Merge pull request #4 from jonas-fuchs/dev

Reworked primer search and introduced primer dimer sensitive amplicon tiling
jonas-fuchs · Mar 25, 2023 · bf122eb · bf122eb
2 parents 8018f7c + 4747dc2
commit bf122eb
Show file tree

Hide file tree

Showing 18 changed files with 1,156 additions and 1,007 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,20 @@
+**var**iable **V**irus**AMP**licons (varVAMP) is a tool to design primers for highly diverse viruses. The input is an alignment of your viral (full-genome) sequences.
+
 # varVAMP
 
 [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
 
-**var**iable **V**irus**AMP**licons is a tool to design primers for highly diverse viruses. The input is an alignment of your viral (full-genome) sequences.
+For a lot of virus genera it is difficult to design pan-specific primers. varVAMP solves this, by introducing ambiguous characters into primers and minimizes mismatches at the 3' end. Primers might not work for some sequences of your input alignment but should recognize the large majority.
 
 **varVAMP comes in three different flavors:**
 
 <img src="./docs/varvamp.png" alt="varVAMP logo" />
 
-**TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing.
-
 **SANGER** *(coming soon)*: varVAMP searches for the very best primers and reports back an amplicon which can be used for PCR-based screening approaches.
 
-**QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers.
+**TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing.
 
-Features that are planned, but have not been implemented yet:
-* Dimer search for amplicon schemes (currently only primer homo-dimers and hetero-dimers within the amplicon are evaluated).
-* Primer BLAST.
+**QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers.
 
 This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate!
 
@@ -27,11 +25,12 @@ This program is currently being developed and in an alpha state. You are welcome
 * [Usage](docs/usage.md)
 * [Output](docs/output.md)
 * [How it works](docs/how_varvamp_works.md)
+* [FAQ](docs/FAQ.md)
 
 ---
 
 **Important disclaimer:**
-*For the primer design, varVAMP uses [primer3](https://pypi.org/project/primer3-py/) to check if digested kmers of a sequence are potential primers. The functions for this were adapted from [primalscheme](www.github.com/aresti/primalscheme) and I do not claim credit.*
+*For the primer design, varVAMP uses [primer3](https://pypi.org/project/primer3-py/) to check if digested kmers of a sequence are potential primers. Some of the functions for this were adapted from [primalscheme](www.github.com/aresti/primalscheme) and I do not claim credit.*
 
 *The remaing code is under the GPLv3 licence. The code is WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or
 (at your option) any later version.*
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -0,0 +1,23 @@
+## FAQ
+
+1. **How do I set settings for varVAMP?**
+
+Start with setting your optimal length of your amplicon, the max length that you can tolerate and the overlap that you want to achieve. If you want to allow ambiguous characters, set them to 2 as a start. Set the threshold to the mean sequence identity of your alignment. Run varvamp and then optimize the output.
+
+2. **How do I optimize the output?**
+
+It all depends on how many conserved regions varVAMP is able to find! There are two main parameters that influence this. The number of ambiguous bases allowed within a primer and the threshold for consensus nucleotides. Setting the threshold higher or the number of ambiguous bases lower will result in less conserved regions. If you have set the parameters below and get a decent output, increase the threshold until the output gets worse. This will increase the specificity of your primers. Likewise, if you do not have a good output, consider increasing the number of ambiguous bases before you lower the threshold. The console output varVAMP will also give you some suggestions.
+
+3. **varVAMP reported primer dimers. What now?**
+
+In your case varVAMP could not find suitable replacement primers. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification.
+
+
+4. **How fast is varVAMP?**
+
+varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER are faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned.
+
+
+
+
+#### [Previous: How it works](./how_varvamp_works.md)
diff --git a/docs/how_varvamp_works.md b/docs/how_varvamp_works.md
@@ -19,26 +19,22 @@ varVAMP creates consensus sequence on the basis of the threshold. If a nucleotid
 varVAMP searches for conserved regions as defined by a user-defined number of ambiguous bases that is allowed within the minimal length of a primer. The algorithm opens windows over the ambiguous consensus sequence and determines if a window satisfies these constraints.
 
 ### Primer search
-varVAMP uses primer3-py to search for potential primers. The evaluation if primers match certain criteria was adapted from [primalscheme](www.github.com/aresti/primalscheme). The primer search contains multiple steps:
-1. Digest the conserved regions in kmers with the min and max length of primers.
-2. Evaluate if these kmers are potential primers (temperature etc) and score their base penalty.
-3. Hardfilter kmers that do not satisfy the max base penalty constraint and number of allowed ambiguous characters at the 3'end.
-4. Filter primers with the same start for the primer with the lowest score.
-5. Score the 3' penalty and permutation penalty.
-6. Fin lowest scoring primer in a window. This  window is defined by the end of the first primer in the window + the max primer length. This greatly reduces the complexity while retaining well scoring primers.
+varVAMP uses primer3-py to search for potential primers. Some of the evaluation if primers match certain criteria was adapted from [primalscheme](www.github.com/aresti/primalscheme). The primer search contains multiple steps:
+1. Digest the conserved regions into kmers with the min and max length of primers. This is performed on a consensus sequence that does not contain ambiguous characters but is just the majority consensus of the alignment. Therefore, primer parameters will be later calculated for the best fitting primer.
+2. Evaluate if these kmers are potential primers direction independent (temp, gc, size, poly-x repeats and poly dinucleotide repeats) and direction dependent (secondary structure, gc clamp, number of GCs in the last 5 bases of the 3'end and min 3' nucleotides without a ambiguous base). Filter for kmers that satisfy all constraints and calculate their penalties (explained in the last section).
+3. Find lowest scoring primer. varVAMP sorts the primers by their score and always takes the best scoring if the primer positions have not been covered by a better primer. This greatly reduces the complexity of the later amplicon search while only retaining the best scoring primer of a set of overlapping primers.
 
 ### Amplicon search
 
 #### Amplicon-tiling
 To search for the best scoring amplicon, varVAMP uses a graph based approach.
-1. Create all possible amplicons with the given length constraints.
-2. Create a graph for amplicons that satisfy the overlap constraints. The cost to go to the next node is determined by the amplicon score.
-3. Use the dijkstra algorithm to find the shortest paths from a given start node.
+1. Create all possible amplicons with the given length constraints and ensure that primer pairs are not forming dimers.
+2. Create a graph containing all amplicons and their potential neighboring amplicons. To design a good scheme, the next primer has to lie within the second half of the current primer and satisfy the overlap constraint. The cost to go to a neighboring amplicon is determined by the amplicon score.
+3. Use the dijkstra algorithm to find the path with the lowest costs from a given start node.
 4. Determine potential stop nodes as those amplicons that have the furthest stop in the alignment.
 5. Determine shortest paths between the start and all potential stop nodes. Get the lowest scoring.
-6. Repeat this process for each start node until the best coverage is reached.
-
-This then is the best scoring amplicon scheme!
+6. Repeat steps 3-5 for each start node until the best coverage is reached. This then is the best scoring amplicon scheme!
+7. Lastly, the best scoring scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping previously not considered primers (primer search step 2) and again minimizes the score. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers.
 
 #### Sanger sequencing
 coming soon
@@ -51,18 +47,18 @@ coming soon
 ```python
 PRIMER_MAX_BASE_PENALTY
 ```
-Each primer is scored for its deviation from the optimums. Base penealty is the cummulative penalty of penalties for temperature, gc and size. Primer base penalties higher than the max base penalty are hardfiltered.
+Each primer is scored for its deviation from the optimal size, gc content and temperature. Base penalty is the sum of these deviations. Primer base penalties higher than the max base penalty are excluded.
 
 ```python
 PRIMER_3_PENALTY
 ```
-Each position in the primer is scored for mismatches in all sequences. If a 3' penalty is given the first in the tuple is multiplied with the freq mismatch at the very 3' end. The next is multiplied with the -1 freq and so on. Increase penalty if you want to shift amplicons torwards best 3' matching. set to 0 if you do not care about 3' mismatches.
+Each position in the primer is scored for mismatches in all sequences. If a 3' penalty is given the first in the tuple is multiplied with the freq mismatch at the very 3' end. The next is multiplied with the -1 freq and so on. Increase penalty if you want to shift amplicons towards best 3' matching. set to 0 if you do not care about 3' mismatches.
 
 ```python3
 PRIMER_PERMUTATION_PENALTY
 ```
 The number permutations of a primer is multiplied by the penalty. For example 24 permutations and a penalty of 0.1 will yield a penalty of 2.4. Set to 0 if you do not care about the number of permutations.
 
-In the end all scores of a primer are summed up and yield a final score. The score for each amplicon is then the score of its LEFT + RIGHT primers multiplied by the fold increase of the amplicon length comapred to the optional length. This insures that in the final scheme not only large amplicons are used.
+All scores of a primer are summed up and yield a final score. The score for each amplicon is then the score of its LEFT + RIGHT primers multiplied by the fold increase of the amplicon length compared to the optional length. This insures that in the final scheme not only large amplicons are used.
 
-#### [Previous: Output](./output.md)
+#### [Previous: Output](./output.md)&emsp;&emsp;[Next: FAQ](./FAQ.md)
diff --git a/docs/installation.md b/docs/installation.md
@@ -1,34 +1,37 @@
 ## Requirements
 varVAMP runs on UNIX Systems, MacOSX and Windows with Python3 >=3.9 installed.
 
-
-## Installation of varvamp with pip
+## Installation
 
 ```shell
-git clone https://github.com/jonas-fuchs/varVAMP
-cd varVAMP
-pip install .
+pip install varvamp
 ```
 That was already it. To check if it worked:
 ```shell
 varvamp -v
 ```
 You should see the current varVAMP version.
 
-## Installation via requirements.txt for development
+## Installation for advanced customization or development
+
+All varVAMP options (such as temperature, size, penalties) can be customized in the config.py. However, to do this you will have to install varvamp not from the PyPI repository, but directly from this github repository.
+
+### - via pip (recommended)
 
 ```shell
 git clone https://github.com/jonas-fuchs/varVAMP
 cd varVAMP
-pip install -r requirements.txt
-python3 varvamp -v
+pip install .
 ```
 
-## Update
+### - via requirements.txt
+
 ```shell
+git clone https://github.com/jonas-fuchs/varVAMP
 cd varVAMP
-git pull
-pip install .
+pip install -r requirements.txt
+python3 varvamp -v
 ```
 
+
 #### [Next: Preparing your data](./preparing_the_data.md)
diff --git a/docs/output.md b/docs/output.md
@@ -11,7 +11,8 @@ varVAMP produces multiple main output files:
 | per_base_mismatches.pdf | Barplot of the percent mismatches at each nucleotide position of the primer. |
 | primer_to_amplicon_assignments.tabular | Simple tab separated file, which primers belong together. Useful for bioinformatic workflows that include primer trimming |
 | primers.bed | A bed file with the primer locations. Includes the primer score. The lower, the better. |
-| primer.tsv | A tab separated file with important parameters for the primers including the sequence. |
+| primer.tsv | A tab separated file with important parameters for the primers including the sequence with ambiguous nucleotides (already in the right strand) and the gc and temperature of the best fitting primer as well as for the mean for all permutations of the primer. |
+| unsolvable_primer_dimers.tsv | Only produced if there are primer dimers without replacements. Tells which primers form dimers and at which temperature.
 | varvamp_log.txt | Log file. |
 
 

diff --git a/docs/preparing_the_data.md b/docs/preparing_the_data.md
@@ -24,7 +24,7 @@ vsearch --cluster_fast <my_sequences.fasta> --clusters <output_dir> --id <float>
 
 ## You just want to test out varVAMP?
 
-No worries, we got you! Test it with an alignment for [Hepatitis E virus](../varvamp/example_data). These sequences were pre-selected with ```vsearch --id 0.83``` from available HepE sequence in GenBank and aligned with MAFFT! varVAMP should finish in seconds with the standard settings.
+No worries, we got you! Test it with an alignment for [Hepatitis E virus](../example_data). These sequences were pre-selected with ```vsearch --id 0.83``` from available HepE sequence in GenBank and aligned with MAFFT! varVAMP should finish in seconds with the standard settings.
 
 
 #### [Previous: Installation](./installation.md)&emsp;&emsp;[Next: Usage](./usage.md)
diff --git a/docs/usage.md b/docs/usage.md
@@ -6,7 +6,7 @@ In the most simple case varvamp will take an alignment and a output directory to
 **Minimal usage:**
 
 ```shell
-varvamp <alignment> <output>
+varvamp <alignment> <output dir>
 ```
 
 In this case varVAMP uses as standard settings:
@@ -21,7 +21,7 @@ These settings are quite relaxed and can produce decent results for diverse viru
 
 **Full usage:**
 ```shell
-varvamp <alignment> <output> [OPTIONS]
+varvamp <alignment> <output dir> [OPTIONS]
 ```
 ```
 positional arguments:
@@ -47,7 +47,7 @@ optional arguments:
 
 ## Further customization (advanced)
 
-Although, we believe that this will be in the most cases not necessary, you can customize all settings for varVAMP that are not specified via commands in the config.py. Here are also all default parameters stored if no optional arguments are given.
+Although, we believe that this will be in the most cases not necessary, you can customize all settings for varVAMP that are not specified via commands in the config.py. Here are also all default parameters stored if no optional arguments are given. [To fully customize varvamp install it directly from this github repository](./installation.md)
 
 Go to the configs location:
 ```shell
@@ -58,36 +58,38 @@ And open the config.py with an text editor, e.g.:
 gedit config.py
 ```
 Here you can adjust various settings including primer parameters and penalties.
-```python
 
+```python
 # basic primer parameters
 PRIMER_TMP = (57, 63, 60)  # temperatur (min, max, opt)
 PRIMER_GC_RANGE = (40, 60, 50)  # gc (min, max, opt)
-PRIMER_SIZES = (18, 27, 20)  # size (min, max, opt)
+PRIMER_SIZES = (17, 27, 20)  # size (min, max, opt)
+PRIMER_MAX_POLYX = 4  # max number of polyx repeats
+PRIMER_MAX_DINUC_REPEATS = 4  # max number of dinucleotide repeats
 PRIMER_HAIRPIN = 47  # max melting temp for secondary structures
-MAX_POLYX = 5  # max number of polyx
-MAX_DINUC_REPEATS = 2  # max number of dinucleotide repeats
-MAX_DIMER_TMP = 21  # max melting temp for dimers (homo- or heterodimers)
-MIN_3_WITHOUT_AMB = 2  # min len of 3' without ambiguous charaters
+PRIMER_MAX_GC_END = 3  # max GCs in the last 5 bases of the primer
+PRIMER_GC_CLAMP = 1  # min number of GC nucleotides at the very 3' end
+PRIMER_MIN_3_WITHOUT_AMB = 2  # min len of 3' without ambiguous charaters
+PRIMER_MAX_DIMER_TMP = 47  # max melting temp for dimers (homo- or heterodimers)
 
-# PCR parameters - adjust to your PCR
-MV_CONC = 50  # monovalent cations mM
-DV_CONC = 2  # divalent cations mM
-DNTP_CONC = 0.8  # dntp concentration mM
-DNA_CONC = 50  # primer concentration nM
+# PCR parameters
+PCR_MV_CONC = 50  # monovalent cations mM
+PCR_DV_CONC = 2  # divalent cations mM
+PCR_DNTP_CONC = 0.8  # dntp concentration mM
+PCR_DNA_CONC = 50  # primer concentration nM
 
 # multipliers for primer base penalties
 PRIMER_TM_PENALTY = 2  # temperature penalty
 PRIMER_GC_PENALTY = 0.2  # gc penalty
 PRIMER_SIZE_PENALTY = 0.5  # size penalty
-PRIMER_MAX_BASE_PENALTY = 8  # penalty for primer hardfiltering
+PRIMER_MAX_BASE_PENALTY = 8  # max base penalty for a primer
 PRIMER_3_PENALTY = (10, 10, 10)  # penalties for 3' mismatches
 PRIMER_PERMUTATION_PENALTY = 0.1  # penalty for the number of permutations
 ```
 To apply these new settings just repeat the installation procedure in the varVAMP dir:
 ```shell
 pip install .
 ```
-If you did everything right, varVAMPs config check passes otherwise it will produce an error. If that happens you can simply perform a git pull or adjust the settings that produced a warning.
+If you did everything right, varVAMP's config check passes. Otherwise it will produce an error. If that happens you can simply perform a git pull or adjust the settings that produced a warning.
 
 #### [Previous: Data preparation](./preparing_the_data.md)&emsp;&emsp;[Next: Output](./output.md)
diff --git a/docs/varvamp.png b/docs/varvamp.png
diff --git a/varvamp/example_data/hepatitis_e_aln.fasta → example_data/hepatitis_e_aln.fasta b/varvamp/example_data/hepatitis_e_aln.fasta → example_data/hepatitis_e_aln.fasta
diff --git a/setup.py b/setup.py
@@ -1,12 +1,19 @@
 from setuptools import setup, find_packages
 from varvamp import __version__, _program
 
+# read the contents of your README file
+from pathlib import Path
+this_directory = Path(__file__).parent
+long_description = (this_directory / "README.md").read_text()
+
 setup(
     name='varvamp',
+    long_description=long_description,
+    long_description_content_type='text/markdown',
     version=__version__,
     python_requires=">=3.9",
-    license_files = ('licence.txt'),
-    packages = find_packages(),
+    license_files=('licence.txt'),
+    packages=find_packages(),
     install_requires=[
         "biopython>=1.79",
         "matplotlib>=3.5.1",
@@ -25,7 +32,7 @@
     entry_points="""
     [console_scripts]
     {program} = varvamp.command:main
-    """.format(program = _program),
+    """.format(program=_program),
     include_package_data=True,
     keywords=[],
     zip_safe=False

diff --git a/varvamp/__init__.py b/varvamp/__init__.py
@@ -1,3 +1,3 @@
 """Tool to design amplicons for highly variable virusgenomes"""
 _program = "varvamp"
-__version__ = "0.2"
+__version__ = "0.3"