Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github minutia #4

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Ignore temporary files automatically created files and directories
CMakeCache.txt
CMakeFiles/
bin/
Makefile
cmake_install.cmake
13 changes: 13 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
language: cpp
compiler: gcc
addons:
apt:
sources:
- george-edison55-precise-backports # for cmake 3.2.3
- ubuntu-toolchain-r-test # for g++-5
packages:
- cmake
- cmake-data
- g++-5

script: cmake -DCMAKE_CXX_COMPILER=g++-5 CMakeCache.txt && make && ./test_SatsumaSynteny2 2>&1 > /dev/null
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 3.3)
cmake_minimum_required(VERSION 3.2)
project(satsuma2)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -std=c++14 -O3 -w")
Expand Down
214 changes: 166 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
[![Build Status](https://travis-ci.org/arendsee/satsuma2.svg?branch=master)](https://travis-ci.org/arendsee/satsuma2)


# Satsuma2

Satsuma2 is an optimsed version of Satsuma, a tool to reliably align large and complex DNA sequences providing maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accomodate the billions of base pairs in vertebrate genomes). Satsuma2 adresses these issues through three novel strategies:
Satsuma2 is an optimsed version of Satsuma, a tool to reliably align large and
complex DNA sequences providing maximum sensitivity (to find all there is to
find), specificity (to only find real homology) and speed (to accomodate the
billions of base pairs in vertebrate genomes). Satsuma2 adresses these issues
through three novel strategies:

* cross-correlation, implemented via fast Fourier transformation.
* cross-correlation, implemented via fast Fourier transformation.
* a match scoring scheme that eliminates almost all false hits.
* an asynchronous "battleship"-like search that enables fast whole-genome alignment.

Expand All @@ -11,22 +18,36 @@ Satsuma2 also interfaces with MizBee, a multi-scale synteny browser for explorin
Satsuma2 is implemented in C++ on Linux.

## Licensing
Satsuma2 is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or(at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.

Satsuma2 is free software: you can redistribute it and/or modify it under the
terms of the Lesser GNU General Public License as published by the Free
Software Foundation, either version 3 of the License, or (at your option) any
later version. This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License
for more details.

## Citing Satsuma2

We plan to submit an application note that should be published during the summer of 2016. In the meantime, if you are using Satsuma2 for research that will be published before that, please contact us to discuss how you can cite the tool.
We plan to submit an application note that should be published during the
summer of 2016. In the meantime, if you are using Satsuma2 for research that
will be published before that, please contact us to discuss how you can cite
the tool.

## Installation

Download the source code from <https://github.com/bjclavijo/satsuma2.git> and compile it using CMake v3.3+. To run, Satsuma2 requires GCC v5.2+. The binaries are generated in the bin/ directory.
Download the source code from <https://github.com/bjclavijo/satsuma2.git> and
compile it using CMake v3.3+. To run, Satsuma2 requires GCC v5.2+. The
binaries are generated in the bin/ directory.

NOTE: if you encounter the error "... undefined reference to `pthread_create'"
during compilation, add the flag `-pthread` to CMakeLists.txt, i.e. change:

NOTE: if you encounter the error "... undefined reference to `pthread_create'" during compilation, add the flag -pthread to CMakeLists.txt, i.e. change:
`set(CMAKE\_CXX\_FLAGS "${CMAKE\_CXX\_FLAGS} -lpthread -std=c++14 -O3 -w")`

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -std=c++14 -O3 -w")
to:
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -pthread -std=c++14 -O3 -w")

`set(CMAKE\_CXX\_FLAGS "${CMAKE\_CXX\_FLAGS} -lpthread -pthread -std=c++14 -O3 -w")`

## Quick start

Expand All @@ -40,7 +61,9 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -pthread -std=c++14 -O3 -w")

## Running Satsuma2

As Satsuma2 calls other executables (HomologyByXCorr, MergeXCorrMatches etc.), you need to set an environment variable to tell the software where to find the binaries;
As Satsuma2 calls other executables (HomologyByXCorr, MergeXCorrMatches etc.),
you need to set an environment variable to tell the software where to find the
binaries;

```
export SATSUMA2_PATH=/path/to/binaries
Expand Down Expand Up @@ -82,14 +105,27 @@ Available arguments:
-dump_cycle_matches<bool> : dump matches on each cycle (for debug/testing) (def=0)
```

Required parameters are query FASTA (-q), target FASTA (-t) and output directory (-o). The query and target sequences are chunked (based on the -t\_chunk and -q\_chunk parameters) and KMatch is used to detect aligning regions between chunks. The number of chunks generated depends on the length of your query and target sequences. The amount of memory reserved for KMatch can be modified using the -km\_mem parameter which defaults to 100Gb.

SatsumaSynteny2 despatches slave processes to compare the chunks which run asynchronously. The number of slaves, threads per slave and memory limit per slave are specified using the -slaves, -threads and -sl\_mem parameters. The default is one single-threaded slave using 100Gb of memory. Slaves can be run on a single machine or submitted via a job submission system such as LSF, PBS or SLURM. The satsuma\_run.sh file is used by SatsumaSynteny2 to start the slaves. Before running SatsumaSynteny2, you need to configure this file to suit your environment by commenting out the lines you don't need with #. For example, to run on SLURM your file should look like this;
Required parameters are query FASTA (-q), target FASTA (-t) and output
directory (-o). The query and target sequences are chunked (based on the
-t\_chunk and -q\_chunk parameters) and KMatch is used to detect aligning
regions between chunks. The number of chunks generated depends on the length
of your query and target sequences. The amount of memory reserved for KMatch
can be modified using the -km\_mem parameter which defaults to 100Gb.

SatsumaSynteny2 despatches slave processes to compare the chunks which run
asynchronously. The number of slaves, threads per slave and memory limit per
slave are specified using the -slaves, -threads and -sl\_mem parameters. The
default is one single-threaded slave using 100Gb of memory. Slaves can be run
on a single machine or submitted via a job submission system such as LSF, PBS
or SLURM. The satsuma\_run.sh file is used by SatsumaSynteny2 to start the
slaves. Before running SatsumaSynteny2, you need to configure this file to
suit your environment by commenting out the lines you don't need with #. For
example, to run on SLURM your file should look like this;

```
# Script for starting Satsuma jobs on different job submission environments
# Comment out the lines not required
# Usage: satsuma_run.sh <current_path> <kmatch_cmd> <ncpus> <mem> <job_id> <run_synchronously>
# Usage: satsuma\_run.sh <current_path> <kmatch_cmd> <ncpus> <mem> <job_id> <run_synchronously>
# mem should be in Gb, ie. 100Gb = 100

# no submission system, run process locally either synchronously or asynchronously
Expand All @@ -113,50 +149,115 @@ sbatch -p tgac-long -c $3 -J $5 -o ${5}.log --mem ${4}G slurm_tmp.sh

```

### Notes
### Notes

* If SatsumaSynteny2 is run without a submission system, KMatch jobs will be
launched synchronously in order to keep memory requirements low. If you have
plenty of memory available you can opt to run the KMatch jobs asynchronously
(-km_sync 0). KMatch requires a lot of memory and multiple KMatch processes
running at the same time may cause SatsumaSynteny2 to abort if not enough
memory is available.

* The parameters -km\_mem and -sl\_mem are only applied when using a job
submission system. We strongly recommend using a job submission system to
run SatsumaSynteny2 which allows more control of the resource requirements of
this software.

* If the output directory is not empty, SatsumaSynteny2 will not overwrite any
files but exit with an error message.

* Idling processes self-terminate after two minutes. The overall alignments
will still complete, but using fewer processes.

* If alignment runs locally but not on the server farm, check whether processes
on the farm can communicate via TCP/IP.

* Currently, the entire sequences are loaded into RAM by each process. For
comparison of large genomes, we strongly recommend to make sure that the CPUs
have enough RAM available (~ the size of both genomes in bytes).


### Parameter choice, execution and data preparation

* The default parameters should work well for most genomes.

* SatsumaSynteny2 runs most efficiently on either multi-processor machines or
on clusters that are tightly coupled (fast access to files shared by the
control process and the slaves)

* Especially for larger genomes, we recommend leaving one CPU dedicated to the
control process SatsumaSynteny2.

* If SatsumaSynteny2 is run without a submission system, KMatch jobs will be launched synchronously in order to keep memory requirements low. If you have plenty of memory available you can opt to run the KMatch jobs asynchronously (-km_sync 0). KMatch requires a lot of memory and multiple KMatch processes running at the same time may cause SatsumaSynteny2 to abort if not enough memory is available.
* The parameters -km\_mem and -sl\_mem are only applied when using a job submission system. We strongly recommend using a job submission system to run SatsumaSynteny2 which allows more control of the resource requirements of this software.
* If the output directory is not empty, SatsumaSynteny2 will not overwrite any files but exit with an error message. * Idling processes self-terminate after two minutes. The overall alignments will still complete, but using fewer processes. * If alignment runs locally but not on the server farm, check whether processes on the farm can communicate via TCP/IP. * Currently, the entire sequences are loaded into RAM by each process. For comparison of large genomes, we strongly recommend to make sure that the CPUs have enough RAM available (~ the size of both genomes in bytes).
* For larger genomes (>1Gb), we recommend using one chromosome of one genome as
the target sequence and the entire other genome as the query sequence, and
process alignments one query chromosome at a time.

### Parameter choice, execution and data preparation
* The default parameters should work well for most genomes.* SatsumaSynteny2 runs most efficiently on either multi-processor machines or on clusters that are tightly coupled (fast access to files shared by the control process and the slaves)* Especially for larger genomes, we recommend leaving one CPU dedicated to the control process SatsumaSynteny2.* For larger genomes (>1Gb), we recommend using one chromosome of one genome as the target sequence and the entire other genome as the query sequence, and process alignments one query chromosome at a time. * To include large-scale duplications in the query sequence (in addition to the target sequence), use the option –dups.* If using the option –nofilter, the number of initial searches (-ni) should be higher than the number of processes (-n) to ensure that subsequent processes have sufficient seeds. Note that initial searches will be queued to a number of processes specified by -n.
* When many processes search a tight space, the number of pixels per CPU (-m) should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid unbalanced load (i.e. some processes get all the pixels while others are starved, since they overlap). However, a small value for –m increases inter-process communication, which should be a consideration when deploying hundreds of processes.
* To include large-scale duplications in the query sequence (in addition to the
target sequence), use the option –dups.

* If using the option –nofilter, the number of initial searches (-ni) should be
higher than the number of processes (-n) to ensure that subsequent processes
have sufficient seeds. Note that initial searches will be queued to a number
of processes specified by -n.

* When many processes search a tight space, the number of pixels per CPU (-m)
should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid
unbalanced load (i.e. some processes get all the pixels while others are
starved, since they overlap). However, a small value for –m increases
inter-process communication, which should be a consideration when deploying
hundreds of processes.


## Making SatsumaSynteny2 converge (a temporary note)

Given a new and more exhaustive convergence model, which is still under active development, Satsuma2 may fail to converge into a single final result, and rather enter an iteration cycle, where lots of small (or not so small) changes are made to the general alignment search strategy. Instead of hiding this behaviour under a fixed cutoff after a number of iterations, we have chosen to expose it, and allow the user to examine and choose the intermediate result that best suits the biological question.
Given a new and more exhaustive convergence model, which is still under active
development, Satsuma2 may fail to converge into a single final result, and
rather enter an iteration cycle, where lots of small (or not so small) changes
are made to the general alignment search strategy. Instead of hiding this
behaviour under a fixed cutoff after a number of iterations, we have chosen to
expose it, and allow the user to examine and choose the intermediate result
that best suits the biological question.

For this reason, we have introduced the parameter -dump\_cycle\_matches, which will produce an output file on each cycle. Because these output files are not particularly large and contain the whole information of the cycle, we recommend to turn this parameter on, unless you're running on datasets where you already know that the convergence setup will work correctly. You can then examine the convergence (probably using the MatchesByFeature tool if one of your genomes is annotated) and decide which solution(s) best suits your objectives.
For this reason, we have introduced the parameter -dump\_cycle\_matches, which
will produce an output file on each cycle. Because these output files are not
particularly large and contain the whole information of the cycle, we recommend
to turn this parameter on, unless you're running on datasets where you already
know that the convergence setup will work correctly. You can then examine the
convergence (probably using the MatchesByFeature tool if one of your genomes is
annotated) and decide which solution(s) best suits your objectives.

We are working on generalising the convergence model so it behaves well under most circumstances, but still this will always be a recommendation when starting to run SatsumaSynteny2 in new scenarios.
We are working on generalising the convergence model so it behaves well under
most circumstances, but still this will always be a recommendation when
starting to run SatsumaSynteny2 in new scenarios.


## Output files

**<outdir>/satsuma_summary.chained.out: final coordinates**

```
Contents:
target sequence name
first target base
last target base
query sequence name
first query base
last query base
identity
orientation
Contents:
target sequence name
first target base
last target base
query sequence name
first query base
last query base
identity
orientation

EXAMPLE:chrX 5947 6164 chrX 9153 9360 0.626728 +
chrX 6270 6452 chrX 9472 9654 0.576923 +```
EXAMPLE: chrX 5947 6164 chrX 9153 9360 0.626728 +
chrX 6270 6452 chrX 9472 9654 0.576923 +
```

Note: 'space' in fasta names is permissible for alignment, but all spaces will
be replaced with "\_" in the output files.

Note: ‘space’ in fasta names is permissible for alignment, but all spaces will be replaced with “_” in the output files.
**<outdir>/MergeXCorrMatches.chained.out: final readable alignments**
**<outdir>/MergeXCorrMatches.chained.out: final readable alignments**

```
EXAMPLE:
Query chr24 [29727636-29727834] vs target scaffold_24 [1206-1404] + length 198 check 198
Query chr24 [29727636-29727834] vs target scaffold\_24 [1206-1404] + length 198 check 198
Identity (w/ indel count): 52.5253 %
-------------------------------------------------------------------------------

Expand Down Expand Up @@ -187,29 +288,46 @@ Run each tool with no arguments to see available options.

### Alignment tool

* ColaAlignSatsuma: realigns global alignments in satsuma format (summary coordinates file) using Cola, an efficient implementation of a collection of sequence alignment algorithms.
* ColaAlignSatsuma: realigns global alignments in satsuma format (summary
coordinates file) using Cola, an efficient implementation of a collection of
sequence alignment algorithms.


### Visualisation tools

* BlockDisplaySatsuma: takes a satsuma summary file and writes displayable blocks in MizBee format, see <http://www.cs.utah.edu/~miriah/mizbee/Overview.html> for how to display this using the MizBee Synteny Browser.
* ChromosomePaint: generates a comparative chromosome view in postscript format from the MizBee file generated by BlockDisplaySatsuma.
* MicroSyntenyPlot: generates a postscript visualisation of synteny from the HomologyByXCorr binary output file (xcorr\_aligns\_final.out).
* BlockDisplaySatsuma: takes a satsuma summary file and writes displayable
blocks in MizBee format, see
<http://www.cs.utah.edu/~miriah/mizbee/Overview.html> for how to display this
using the MizBee Synteny Browser.

* ChromosomePaint: generates a comparative chromosome view in postscript format
from the MizBee file generated by BlockDisplaySatsuma.

* MicroSyntenyPlot: generates a postscript visualisation of synteny from the
HomologyByXCorr binary output file (xcorr\_aligns\_final.out).


### Scaffold synteny tools

* Chromosemble: runs a pipeline that scaffolds an assembly using synteny.
* OrderOrientBySynteny: orders and orients scaffolds according to a synteny map.

* OrderOrientBySynteny: orders and orients scaffolds according to a synteny
map.


### Other useful tools
* MatchesByFeature: report matches by specific features using a GFF3 file. To show matches to exon and CDS features defined in GFF file genome.gff using match files match1 and match2 use;

* MatchesByFeature: report matches by specific features using a GFF3 file. To
show matches to exon and CDS features defined in GFF file genome.gff using
match files match1 and match2 use;

```
./MatchesByFeature genome.gff exon CDS - - - - - match1 match2
```
* ReverseSatsumaOut: swaps query and target columns in the satsuma output file.
* SatsumaToFasta: generates a FASTA file using a satsuma summary file from either the query or the target genome.
* SatsumaToGFF: generates a GFF3 file from a satsuma summary file.


* ReverseSatsumaOut: swaps query and target columns in the satsuma output file.

* SatsumaToFasta: generates a FASTA file using a satsuma summary file from
either the query or the target genome.

* SatsumaToGFF: generates a GFF3 file from a satsuma summary file.