Skip to content

Commit

Permalink
Merge pull request #99 from inab/v1.5.0_release
Browse files Browse the repository at this point in the history
v1.5.0 release
  • Loading branch information
scapella authored Jul 2, 2024
2 parents b335b7d + f97d5ed commit 4cf89f5
Show file tree
Hide file tree
Showing 45 changed files with 1,068 additions and 95 deletions.
35 changes: 35 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Read the Docs configuration file for Sphinx projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.11"
# You can also specify other tool versions:
# nodejs: "20"
# rust: "1.70"
# golang: "1.20"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/source/conf.py
# You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
# builder: "dirhtml"
# Fail on all warnings to avoid broken references
# fail_on_warning: true

# Optionally build your docs in additional formats such as PDF and ePub
# formats:
# - pdf
# - epub

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- requirements: docs/requirements.txt
20 changes: 12 additions & 8 deletions AUTHORS
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
** Authors **
* Authors*
Victor Fernández Rodríguez.
INB Hub. Life Sciences Department.
Barcelona Supercomputing Center (BSC). Barcelona, Spain.
e-mail: victor.fernandez _at_ bsc.es

Salvador Capella-Gutierrez.
Comparative Genomics Group. Bioinformatics and Genomics Department.
Centre for Genomic Regulation. Barcelona, Spain.
e-mail: scapella _at_ crg.es
Salvador Capella-Gutiérrez.
INB Hub. Life Sciences Department.
Barcelona Supercomputing Center (BSC). Barcelona, Spain.
e-mail: salvador.capella _at_ bsc.es

Toni Gabaldón.
Comparative Genomics Group. Bioinformatics and Genomics Department.
Centre for Genomic Regulation. Barcelona, Spain.
e-mail: tgabaldon _at_ crg.es
Comparative Genomics Group. Life Sciences Department.
Barcelona Supercomputing Center (BSC). Barcelona, Spain.
e-mail: toni.gabaldon _at_ bsc.es

** Authors (until trimAl v1.1) **

Expand Down
3 changes: 3 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sphinxcontrib-email
sphinx_copybutton
sphinx-rtd-theme
Binary file added docs/source/_static/automated1_tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/benchmarkings.pdf
Binary file not shown.
Binary file added docs/source/_static/gappyout_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/gappyout_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/overlap_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/strict_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/strict_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/strictplus_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/supplementary_material.pdf
Binary file not shown.
Binary file added docs/source/_static/trimal.v1.2rev59.tar.gz
Binary file not shown.
5 changes: 5 additions & 0 deletions docs/source/_static/trimal_doc.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@


.wy-nav-content {
max-width: 1200px;
}
174 changes: 174 additions & 0 deletions docs/source/algorithms.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
Trimming algorithms
***********************

Manual methods
========================

Custom columns
------------------------
This algorithm eliminates a specified set of columns defined by the user. The set of columns to be removed should be provided as individual column numbers separated by commas, and/or as blocks of consecutive columns indicated by the first and last column numbers separated by a hyphen. In the following example::

-selectcols { n,l,m-k }

where n and l are interpreted single column numbers, while m-k is a range of columns (from column m to column k, both included) to be deleted. Note that column numbering starts from 0. For instance, the command::

-selectcols { 2,7,20-25,80-100 }

will remove columns 2 and 7, along with two blocks of columns ranging from column 20 to 25 and 80 to 100, respectively.


Threshold-based trimming
------------------------
The user can choose to remove all columns that do not meet a specified threshold or a combination of thresholds. The gap threshold (*-gt*) and similarity threshold (*-st*) represent the minimum values of the respective scores explained above and can be used individually or in combination. Like the scores they refer to, both thresholds range from 0 to 1.

trimAl provides two shortcuts to commonly used thresholds: *-nogaps* (equivalent to *-gt 1*), which deletes all columns with at least one gap, and *-noallgaps*, which removes columns composed solely of gaps.

In addition, the user can set a conservation threshold (*-cons*), indicating the minimum percentage of columns from the input alignment that should be retained in the trimmed alignment. This threshold is defined between 0 and 100 and takes precedence over all other thresholds. If any other threshold would result in a trimmed alignment with fewer columns than specified by the conservation threshold, trimAl adds more columns to meet the conservation threshold. These columns are added based on their scores, with a preference for columns with higher scores. In the case of equal scores, columns adjacent to already selected column-blocks and closer to the center of the alignment are added first,
prioritizing the extension of longer and central blocks.


When provided with a set of multiple sequence alignments, trimAl calculates a consistency score for each alignment in the set. Subsequently, the alignment with the highest score is selected. The chosen alignment can undergo various trimming methods, one of which involves removing columns that exhibit lower consistency across the other alignments. To achieve this, the user can utilize the *-ct* parameter to define the minimum values for the consistency score, within the range of 0 to 1. Any columns not meeting this specified value will be removed. Alternatively, the conservation score, as explained previously, can also be employed here. Moreover, it can be used in conjunction with gap and/or similarity methods.


Overlap trimming
------------------------
trimAl can also remove poorly aligned or incomplete sequences considering the rest of
sequences in the MSA. For that purpose, the user has to define two thresholds:
First, the residue overlap threshold (-resoverlap) corresponds to the minimum residue
overlap score for each residue.
Second, the sequence overlap threshold (-seqoverlap) sets up the minimum percentage of
the residues for each sequence that should pass the residue overlap threshold in order to
maintain the sequence in the new alignment. Sequences that do not pass the sequence
overlap threshold will be removed from the alignment. Finally, all columns that only have
gaps in the new alignment will also be removed from the final alignment.

trimAl can effectively eliminate poorly aligned or incomplete sequences while considering the entire multiple sequence alignment (MSA). To achieve this, users need to specify two thresholds:

1. **Residue Overlap Threshold (-resoverlap):** This threshold corresponds to the minimum residue overlap score required for each residue.

2. **Sequence Overlap Threshold (-seqoverlap):** This threshold establishes the minimum percentage of residues within each sequence that must surpass the residue overlap threshold to retain the sequence in the new alignment. Sequences failing to meet this criterion will be excluded from the final alignment.

Additionally, columns exclusively filled with gaps in the new alignment will be systematically removed.

.. _overlap-example-figure:
.. figure:: _static/overlap_example.png
:name: overlap-example
:width: 500px
:align: center
:alt: Overlap example MSA

An example of an alignment trimmed with the overlap method. We have used the same alignment as in :numref:`gappyout-example-figure`. Conserved (grey) and trimmed (white) columns are indicated.

Automated methods
========================

.. _gappyout_method:
Gappyout method
------------------------
This method relies on the gap distribution within the multiple sequence alignment (MSA). This method relies on the gap distribution within the Multiple Sequence Alignment (MSA). Initially, the method calculates gap scores for all columns and arranges them based on this score, generating a plot depicting potential gap score thresholds versus the percentage of the alignment below each threshold (see :numref:`gappyout-figure`). In the subsequent step, for every set of three consecutive points on this plot, trimAl computes the slopes between the first and third point, represented by blue lines. Following a comparison of all slopes, trimAl identifies the point with the maximal variation between consecutive slopes, indicated by a vertical red line in :numref:`gappyout-figure`.


.. _gappyout-figure:
.. figure:: _static/gappyout_plot.png
:name: gappyout-plot
:width: 500px
:align: center
:alt: Gappyout plot

Example of an internal trimAl plot showing possible gap score thresholds (y axis) versus percentages of alignment length below that threshold (x axis). Thin blue lines indicate slopes computed by the program. The vertical red line indicates the cut-off point
selected by the gappyout algorithm.


After determining a gap score cut-off point, trimAl removes all columns that do not meet this specified value (see :numref:`gappyout-example-figure`). In practical terms, this method effectively identifies the bimodal distribution of gap scores (columns rich in gaps and columns with fewer gaps) within an alignment. Subsequently, it eliminates the mode associated with a higher concentration of gaps. Our benchmarks indicate that this method efficiently eliminates a significant portion of poorly aligned regions.


.. _gappyout-example-figure:
.. figure:: _static/gappyout_example.png
:name: gappyout-example
:width: 500px
:align: center
:alt: Gappyout example MSA

An example of an alignment trimmed with the gappyout method. Conserved (grey) and trimmed (white) columns are indicated. This figure has been generated with trimAl -htmlout option.

.. _strict_method:
Strict method
------------------------

This method combines gappyout trimming with subsequent trimming based on an automatically selected similarity threshold. To determine the similarity threshold, trimAl utilizes the residue similarity scores distribution from the multiple sequence alignment (MSA). This distribution is transformed to a logarithmic scale (refer to :numref:`strict-figure`), and the residue similarity cutoff is selected as explained below.

.. _strict-figure:
.. figure:: _static/strict_plot.png
:name: strict-plot
:width: 500px
:align: center
:alt: Strict plot

trimAl's internal plot representing similarity score values versus the percentage of the alignment above that value. Vertical blue lines indicate the significant values at 20 and 80 percentiles. The cut-off point is indicated with a red vertical line.

From this similarity distribution, trimAl selects the values at percentiles 20 and 80 of the alignment length (vertical blue lines in :numref:`strict-figure`). The residue similarity threshold (vertical red line in :numref:`strict-figure`) is computed as follows:

.. math::
P_{20} = \log(\text{Simvalue}_{20})
P_{80} = \log(\text{Simvalue}_{80})
SimThreshold = \left(P_{80} + \frac{{P_{20} - P_{80}}}{10} \right)^{10}
This process is equivalent to establishing upper and lower boundaries for the threshold at percentiles 20 and 80, respectively, of the similarity score distribution in that alignment. The similarity threshold is calculated using the difference between these two boundaries, being at 1/10 to the lower boundary (similarity at P80).

This method of setting the similarity threshold has demonstrated optimal performance in our benchmarks. The lower and upper boundaries ensure that the 20% most conserved columns in the alignment are preserved, while the 20% most dissimilar columns are discarded.

The specific similarity threshold will lie between these boundaries depending on the distribution of similarity scores in the alignments. Alignments with steep similarity score curves and significant differences between the most similar and dissimilar columns will set more columns below the threshold. Conversely, alignments with more columns having scores similar to the most-conserved fraction will apply more relaxed cutoffs. However, the removal of a specific column will depend on its context.


Once trimAl has calculated the residue similarity cutoff, the following steps are taken:

1. The :ref:`gappyout method <gappyout_method>` is applied to identify columns that would be deleted with that method.
2. Residues below the similarity cutoff are marked.
3. After applying these filters, trimAl recovers (unmarks) columns that have not passed the gap and/or similarity thresholds but where three of the four most immediate neighboring columns (two on each side) have passed them.
4. Finally, in a last step, trimAl removes all columns that do not fall within a block of at least five consecutive columns unmarked for deletion.


.. _strict-example-figure:
.. figure:: _static/strict_example.png
:name: strict-example
:width: 500px
:align: center
:alt: Strict example MSA

An example of an alignment trimmed with the strict method. We have used the same alignment as in :numref:`gappyout-example-figure`. Conserved (grey) and trimmed (white) columns are indicated.


Strictplus method
------------------------
This approach is very similar to the strict method. However, the final step of the algorithm is slightly different. In this case, the block size is defined as 1% of the alignment size with a minimum value of 3 and a maximum size of 12.

This method is optimized for neighbor joining phylogenetic tree reconstruction.

.. _strictplus-example-figure:
.. figure:: _static/strictplus_example.png
:name: strictplus-example
:width: 500px
:align: center
:alt: Strictplus example MSA

An example of an alignment trimmed with strictplus method. In this case, the block size has automatically been set to 12 because the alignment length is greater than 1200 residues. Again, the same alignment as the previous figures :numref:`gappyout-example-figure` and :numref:`strict-example-figure` has been used.


Automated1 method
------------------------
Based on our own benchmarks with simulated alignments (see :doc:`benchmarking <benchmarking>`) we have designed a heuristic approach, denoted as automated1, to determine the optimal automatic method for trimming a given alignment. This heuristic is specifically fine-tuned for trimming alignments intended for maximum likelihood phylogenetic analyses.

Making use of a decision tree (:numref:`gappyout-example-figure`) , this heuristic dynamically selects between the :ref:`gappyout <gappyout_method>` and :ref:`strict <strict_method>` methods. In making this choice, trimAl considers the average identity score among all the sequences in the alignment, the average identity score for each most similar pair of sequences in the alignment, as well as the number of sequences in the alignment. We have observed that all these variables were important in deciding which method would provide the highest improvement on a given alignment.

.. _automated1-figure:
.. figure:: _static/automated1_tree.png
:name: automated1
:width: 500px
:align: center
:alt: Automated1 decision tree

A decision tree for the heuristic method automated1. trimAl uses strict (light blue) or gappyout (light grey) methods depending on 1) the average identity score (Avg. identity score) among the sequences in the alignment, 2) the number of sequences in the alignment and 3) the average identity score (max Identity Score) computed from the maximum identity score for each sequence in the alignment. We use light yellow color to highlight the decisions in the tree.

40 changes: 40 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'trimAl'
copyright = '2024, trimAl team'
author = 'trimAl team'
release = '1.5.0'
version = release

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [ 'sphinxcontrib.email', 'sphinx_copybutton']

templates_path = ['_templates']
exclude_patterns = []



# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

#html_theme = 'alabaster'
html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']

html_title = 'trimAl'

html_css_files = ["https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css", "trimal_doc.css"]

copybutton_prompt_text = "$ "
html_extra_path=['googlef52b8b3434a2466e.html']

numfig = True
language = 'en'
1 change: 1 addition & 0 deletions docs/source/googlef52b8b3434a2466e.html
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
google-site-verification: googlef52b8b3434a2466e.html
69 changes: 69 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@

.. raw:: html

<a href="https://github.com/inab/trimal">
<i class="fa fa-github" style="text-decoration: none;"></i>
</a>

.. toctree::
:hidden:
:maxdepth: 3

installation
usage
scores
algorithms
benchmarking


Welcome to trimAl's documentation!

trimAl
==================
This is trimAl's information page. You can also find information related to readAl, a MSA format conversor.
trimAl is a tool for the automated removal of spurious sequences or poorly aligned regions from a multiple sequence alignment.

trimAl can consider several parameters, alone or in multiple combinations, in order to select the most-reliable positions in the alignment.
These include the proportion of sequences with a gap, the level of residue similarity and, if several alignments for the same set of sequences are provided, the consistency level of columns among alignments.
Moreover, trimAl allows to manually select a set of columns and sequences to be removed from the alignment.

trimAl implements a series of automated algorithms that trim the alignment searching for optimum thresholds based on inherent characteristics of the input alignment, to be used so that the signal-to-noise ratio after alignment trimming phase is increased.

Among trimAl's additional features, trimAl allows getting the complementary alignment (columns that were trimmed), to compute statistics from the alignment, to select the output file format , to get a summary of trimAl's trimming in HTML and SVG formats, and many other options.



Publications
============
- `trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses <https://doi.org/10.1093/bioinformatics/btp348>`_
- `trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses (pdf) <https://academic.oup.com/bioinformatics/article-pdf/25/15/1972/48994574/bioinformatics_25_15_1972.pdf>`_
- `Supplementary material (pdf) <_static/supplementary_material.pdf>`_

Citing this tool
==================
::

Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics (Oxford, England), 25(15), 1972–1973. https://doi.org/10.1093/bioinformatics/btp348


License
========
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, the last available version.

Contact
========
If you have any doubt or problem feel free to open a `Github issue <https://github.com/inab/trimal/issues>`_ in the repository.

- :email:`Nicolás Díaz Roussel <[email protected]>`
- :email:`Salvador Capella Gutierrez <[email protected]>`


Development team
================
- Salvador Capella Gutiérrez
- Toni Gabaldón
- Víctor Fernández Rodríguez
- Jose M. Silla Martínez (former)

Loading

0 comments on commit 4cf89f5

Please sign in to comment.