Skip to content

Commit

Permalink
Merge pull request #4 from jonas-fuchs/dev
Browse files Browse the repository at this point in the history
Reworked primer search and introduced primer dimer sensitive amplicon tiling
  • Loading branch information
jonas-fuchs authored Mar 25, 2023
2 parents 8018f7c + 4747dc2 commit bf122eb
Show file tree
Hide file tree
Showing 18 changed files with 1,156 additions and 1,007 deletions.
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,20 @@
**var**iable **V**irus**AMP**licons (varVAMP) is a tool to design primers for highly diverse viruses. The input is an alignment of your viral (full-genome) sequences.

# varVAMP

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

**var**iable **V**irus**AMP**licons is a tool to design primers for highly diverse viruses. The input is an alignment of your viral (full-genome) sequences.
For a lot of virus genera it is difficult to design pan-specific primers. varVAMP solves this, by introducing ambiguous characters into primers and minimizes mismatches at the 3' end. Primers might not work for some sequences of your input alignment but should recognize the large majority.

**varVAMP comes in three different flavors:**

<img src="./docs/varvamp.png" alt="varVAMP logo" />

**TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing.

**SANGER** *(coming soon)*: varVAMP searches for the very best primers and reports back an amplicon which can be used for PCR-based screening approaches.

**QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers.
**TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing.

Features that are planned, but have not been implemented yet:
* Dimer search for amplicon schemes (currently only primer homo-dimers and hetero-dimers within the amplicon are evaluated).
* Primer BLAST.
**QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers.

This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate!

Expand All @@ -27,11 +25,12 @@ This program is currently being developed and in an alpha state. You are welcome
* [Usage](docs/usage.md)
* [Output](docs/output.md)
* [How it works](docs/how_varvamp_works.md)
* [FAQ](docs/FAQ.md)

---

**Important disclaimer:**
*For the primer design, varVAMP uses [primer3](https://pypi.org/project/primer3-py/) to check if digested kmers of a sequence are potential primers. The functions for this were adapted from [primalscheme](www.github.com/aresti/primalscheme) and I do not claim credit.*
*For the primer design, varVAMP uses [primer3](https://pypi.org/project/primer3-py/) to check if digested kmers of a sequence are potential primers. Some of the functions for this were adapted from [primalscheme](www.github.com/aresti/primalscheme) and I do not claim credit.*

*The remaing code is under the GPLv3 licence. The code is WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.*
23 changes: 23 additions & 0 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## FAQ

1. **How do I set settings for varVAMP?**

Start with setting your optimal length of your amplicon, the max length that you can tolerate and the overlap that you want to achieve. If you want to allow ambiguous characters, set them to 2 as a start. Set the threshold to the mean sequence identity of your alignment. Run varvamp and then optimize the output.

2. **How do I optimize the output?**

It all depends on how many conserved regions varVAMP is able to find! There are two main parameters that influence this. The number of ambiguous bases allowed within a primer and the threshold for consensus nucleotides. Setting the threshold higher or the number of ambiguous bases lower will result in less conserved regions. If you have set the parameters below and get a decent output, increase the threshold until the output gets worse. This will increase the specificity of your primers. Likewise, if you do not have a good output, consider increasing the number of ambiguous bases before you lower the threshold. The console output varVAMP will also give you some suggestions.

3. **varVAMP reported primer dimers. What now?**

In your case varVAMP could not find suitable replacement primers. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification.


4. **How fast is varVAMP?**

varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER are faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned.




#### [Previous: How it works](./how_varvamp_works.md)
30 changes: 13 additions & 17 deletions docs/how_varvamp_works.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,26 +19,22 @@ varVAMP creates consensus sequence on the basis of the threshold. If a nucleotid
varVAMP searches for conserved regions as defined by a user-defined number of ambiguous bases that is allowed within the minimal length of a primer. The algorithm opens windows over the ambiguous consensus sequence and determines if a window satisfies these constraints.

### Primer search
varVAMP uses primer3-py to search for potential primers. The evaluation if primers match certain criteria was adapted from [primalscheme](www.github.com/aresti/primalscheme). The primer search contains multiple steps:
1. Digest the conserved regions in kmers with the min and max length of primers.
2. Evaluate if these kmers are potential primers (temperature etc) and score their base penalty.
3. Hardfilter kmers that do not satisfy the max base penalty constraint and number of allowed ambiguous characters at the 3'end.
4. Filter primers with the same start for the primer with the lowest score.
5. Score the 3' penalty and permutation penalty.
6. Fin lowest scoring primer in a window. This window is defined by the end of the first primer in the window + the max primer length. This greatly reduces the complexity while retaining well scoring primers.
varVAMP uses primer3-py to search for potential primers. Some of the evaluation if primers match certain criteria was adapted from [primalscheme](www.github.com/aresti/primalscheme). The primer search contains multiple steps:
1. Digest the conserved regions into kmers with the min and max length of primers. This is performed on a consensus sequence that does not contain ambiguous characters but is just the majority consensus of the alignment. Therefore, primer parameters will be later calculated for the best fitting primer.
2. Evaluate if these kmers are potential primers direction independent (temp, gc, size, poly-x repeats and poly dinucleotide repeats) and direction dependent (secondary structure, gc clamp, number of GCs in the last 5 bases of the 3'end and min 3' nucleotides without a ambiguous base). Filter for kmers that satisfy all constraints and calculate their penalties (explained in the last section).
3. Find lowest scoring primer. varVAMP sorts the primers by their score and always takes the best scoring if the primer positions have not been covered by a better primer. This greatly reduces the complexity of the later amplicon search while only retaining the best scoring primer of a set of overlapping primers.

### Amplicon search

#### Amplicon-tiling
To search for the best scoring amplicon, varVAMP uses a graph based approach.
1. Create all possible amplicons with the given length constraints.
2. Create a graph for amplicons that satisfy the overlap constraints. The cost to go to the next node is determined by the amplicon score.
3. Use the dijkstra algorithm to find the shortest paths from a given start node.
1. Create all possible amplicons with the given length constraints and ensure that primer pairs are not forming dimers.
2. Create a graph containing all amplicons and their potential neighboring amplicons. To design a good scheme, the next primer has to lie within the second half of the current primer and satisfy the overlap constraint. The cost to go to a neighboring amplicon is determined by the amplicon score.
3. Use the dijkstra algorithm to find the path with the lowest costs from a given start node.
4. Determine potential stop nodes as those amplicons that have the furthest stop in the alignment.
5. Determine shortest paths between the start and all potential stop nodes. Get the lowest scoring.
6. Repeat this process for each start node until the best coverage is reached.

This then is the best scoring amplicon scheme!
6. Repeat steps 3-5 for each start node until the best coverage is reached. This then is the best scoring amplicon scheme!
7. Lastly, the best scoring scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping previously not considered primers (primer search step 2) and again minimizes the score. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers.

#### Sanger sequencing
coming soon
Expand All @@ -51,18 +47,18 @@ coming soon
```python
PRIMER_MAX_BASE_PENALTY
```
Each primer is scored for its deviation from the optimums. Base penealty is the cummulative penalty of penalties for temperature, gc and size. Primer base penalties higher than the max base penalty are hardfiltered.
Each primer is scored for its deviation from the optimal size, gc content and temperature. Base penalty is the sum of these deviations. Primer base penalties higher than the max base penalty are excluded.

```python
PRIMER_3_PENALTY
```
Each position in the primer is scored for mismatches in all sequences. If a 3' penalty is given the first in the tuple is multiplied with the freq mismatch at the very 3' end. The next is multiplied with the -1 freq and so on. Increase penalty if you want to shift amplicons torwards best 3' matching. set to 0 if you do not care about 3' mismatches.
Each position in the primer is scored for mismatches in all sequences. If a 3' penalty is given the first in the tuple is multiplied with the freq mismatch at the very 3' end. The next is multiplied with the -1 freq and so on. Increase penalty if you want to shift amplicons towards best 3' matching. set to 0 if you do not care about 3' mismatches.

```python3
PRIMER_PERMUTATION_PENALTY
```
The number permutations of a primer is multiplied by the penalty. For example 24 permutations and a penalty of 0.1 will yield a penalty of 2.4. Set to 0 if you do not care about the number of permutations.

In the end all scores of a primer are summed up and yield a final score. The score for each amplicon is then the score of its LEFT + RIGHT primers multiplied by the fold increase of the amplicon length comapred to the optional length. This insures that in the final scheme not only large amplicons are used.
All scores of a primer are summed up and yield a final score. The score for each amplicon is then the score of its LEFT + RIGHT primers multiplied by the fold increase of the amplicon length compared to the optional length. This insures that in the final scheme not only large amplicons are used.

#### [Previous: Output](./output.md)
#### [Previous: Output](./output.md)&emsp;&emsp;[Next: FAQ](./FAQ.md)
25 changes: 14 additions & 11 deletions docs/installation.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,37 @@
## Requirements
varVAMP runs on UNIX Systems, MacOSX and Windows with Python3 >=3.9 installed.


## Installation of varvamp with pip
## Installation

```shell
git clone https://github.com/jonas-fuchs/varVAMP
cd varVAMP
pip install .
pip install varvamp
```
That was already it. To check if it worked:
```shell
varvamp -v
```
You should see the current varVAMP version.

## Installation via requirements.txt for development
## Installation for advanced customization or development

All varVAMP options (such as temperature, size, penalties) can be customized in the config.py. However, to do this you will have to install varvamp not from the PyPI repository, but directly from this github repository.

### - via pip (recommended)

```shell
git clone https://github.com/jonas-fuchs/varVAMP
cd varVAMP
pip install -r requirements.txt
python3 varvamp -v
pip install .
```

## Update
### - via requirements.txt

```shell
git clone https://github.com/jonas-fuchs/varVAMP
cd varVAMP
git pull
pip install .
pip install -r requirements.txt
python3 varvamp -v
```


#### [Next: Preparing your data](./preparing_the_data.md)
3 changes: 2 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ varVAMP produces multiple main output files:
| per_base_mismatches.pdf | Barplot of the percent mismatches at each nucleotide position of the primer. |
| primer_to_amplicon_assignments.tabular | Simple tab separated file, which primers belong together. Useful for bioinformatic workflows that include primer trimming |
| primers.bed | A bed file with the primer locations. Includes the primer score. The lower, the better. |
| primer.tsv | A tab separated file with important parameters for the primers including the sequence. |
| primer.tsv | A tab separated file with important parameters for the primers including the sequence with ambiguous nucleotides (already in the right strand) and the gc and temperature of the best fitting primer as well as for the mean for all permutations of the primer. |
| unsolvable_primer_dimers.tsv | Only produced if there are primer dimers without replacements. Tells which primers form dimers and at which temperature.
| varvamp_log.txt | Log file. |


Expand Down
2 changes: 1 addition & 1 deletion docs/preparing_the_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ vsearch --cluster_fast <my_sequences.fasta> --clusters <output_dir> --id <float>

## You just want to test out varVAMP?

No worries, we got you! Test it with an alignment for [Hepatitis E virus](../varvamp/example_data). These sequences were pre-selected with ```vsearch --id 0.83``` from available HepE sequence in GenBank and aligned with MAFFT! varVAMP should finish in seconds with the standard settings.
No worries, we got you! Test it with an alignment for [Hepatitis E virus](../example_data). These sequences were pre-selected with ```vsearch --id 0.83``` from available HepE sequence in GenBank and aligned with MAFFT! varVAMP should finish in seconds with the standard settings.


#### [Previous: Installation](./installation.md)&emsp;&emsp;[Next: Usage](./usage.md)
34 changes: 18 additions & 16 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ In the most simple case varvamp will take an alignment and a output directory to
**Minimal usage:**

```shell
varvamp <alignment> <output>
varvamp <alignment> <output dir>
```

In this case varVAMP uses as standard settings:
Expand All @@ -21,7 +21,7 @@ These settings are quite relaxed and can produce decent results for diverse viru

**Full usage:**
```shell
varvamp <alignment> <output> [OPTIONS]
varvamp <alignment> <output dir> [OPTIONS]
```
```
positional arguments:
Expand All @@ -47,7 +47,7 @@ optional arguments:

## Further customization (advanced)

Although, we believe that this will be in the most cases not necessary, you can customize all settings for varVAMP that are not specified via commands in the config.py. Here are also all default parameters stored if no optional arguments are given.
Although, we believe that this will be in the most cases not necessary, you can customize all settings for varVAMP that are not specified via commands in the config.py. Here are also all default parameters stored if no optional arguments are given. [To fully customize varvamp install it directly from this github repository](./installation.md)

Go to the configs location:
```shell
Expand All @@ -58,36 +58,38 @@ And open the config.py with an text editor, e.g.:
gedit config.py
```
Here you can adjust various settings including primer parameters and penalties.
```python

```python
# basic primer parameters
PRIMER_TMP = (57, 63, 60) # temperatur (min, max, opt)
PRIMER_GC_RANGE = (40, 60, 50) # gc (min, max, opt)
PRIMER_SIZES = (18, 27, 20) # size (min, max, opt)
PRIMER_SIZES = (17, 27, 20) # size (min, max, opt)
PRIMER_MAX_POLYX = 4 # max number of polyx repeats
PRIMER_MAX_DINUC_REPEATS = 4 # max number of dinucleotide repeats
PRIMER_HAIRPIN = 47 # max melting temp for secondary structures
MAX_POLYX = 5 # max number of polyx
MAX_DINUC_REPEATS = 2 # max number of dinucleotide repeats
MAX_DIMER_TMP = 21 # max melting temp for dimers (homo- or heterodimers)
MIN_3_WITHOUT_AMB = 2 # min len of 3' without ambiguous charaters
PRIMER_MAX_GC_END = 3 # max GCs in the last 5 bases of the primer
PRIMER_GC_CLAMP = 1 # min number of GC nucleotides at the very 3' end
PRIMER_MIN_3_WITHOUT_AMB = 2 # min len of 3' without ambiguous charaters
PRIMER_MAX_DIMER_TMP = 47 # max melting temp for dimers (homo- or heterodimers)

# PCR parameters - adjust to your PCR
MV_CONC = 50 # monovalent cations mM
DV_CONC = 2 # divalent cations mM
DNTP_CONC = 0.8 # dntp concentration mM
DNA_CONC = 50 # primer concentration nM
# PCR parameters
PCR_MV_CONC = 50 # monovalent cations mM
PCR_DV_CONC = 2 # divalent cations mM
PCR_DNTP_CONC = 0.8 # dntp concentration mM
PCR_DNA_CONC = 50 # primer concentration nM

# multipliers for primer base penalties
PRIMER_TM_PENALTY = 2 # temperature penalty
PRIMER_GC_PENALTY = 0.2 # gc penalty
PRIMER_SIZE_PENALTY = 0.5 # size penalty
PRIMER_MAX_BASE_PENALTY = 8 # penalty for primer hardfiltering
PRIMER_MAX_BASE_PENALTY = 8 # max base penalty for a primer
PRIMER_3_PENALTY = (10, 10, 10) # penalties for 3' mismatches
PRIMER_PERMUTATION_PENALTY = 0.1 # penalty for the number of permutations
```
To apply these new settings just repeat the installation procedure in the varVAMP dir:
```shell
pip install .
```
If you did everything right, varVAMPs config check passes otherwise it will produce an error. If that happens you can simply perform a git pull or adjust the settings that produced a warning.
If you did everything right, varVAMP's config check passes. Otherwise it will produce an error. If that happens you can simply perform a git pull or adjust the settings that produced a warning.

#### [Previous: Data preparation](./preparing_the_data.md)&emsp;&emsp;[Next: Output](./output.md)
Binary file modified docs/varvamp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
13 changes: 10 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
from setuptools import setup, find_packages
from varvamp import __version__, _program

# read the contents of your README file
from pathlib import Path
this_directory = Path(__file__).parent
long_description = (this_directory / "README.md").read_text()

setup(
name='varvamp',
long_description=long_description,
long_description_content_type='text/markdown',
version=__version__,
python_requires=">=3.9",
license_files = ('licence.txt'),
packages = find_packages(),
license_files=('licence.txt'),
packages=find_packages(),
install_requires=[
"biopython>=1.79",
"matplotlib>=3.5.1",
Expand All @@ -25,7 +32,7 @@
entry_points="""
[console_scripts]
{program} = varvamp.command:main
""".format(program = _program),
""".format(program=_program),
include_package_data=True,
keywords=[],
zip_safe=False
Expand Down
2 changes: 1 addition & 1 deletion varvamp/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""Tool to design amplicons for highly variable virusgenomes"""
_program = "varvamp"
__version__ = "0.2"
__version__ = "0.3"
Loading

0 comments on commit bf122eb

Please sign in to comment.