Skip to content

Commit

Permalink
Version 1.4.0:
Browse files Browse the repository at this point in the history
- Added the script syntracker_makeDB.py to create a blastDB directory, that can be later used in multiple SynTracker runs.
- Added the '-blastDB' command-line argument to use the directory that was created by syntracker_makeDB.py.
- Removed the '--avg_all' command-line argument - this option became default.
 -Fixed a problem in the execution of the 'continue' mode.
  • Loading branch information
inbalpaz authored and ipaz committed Nov 11, 2024
1 parent bb6edd3 commit bd6d56b
Show file tree
Hide file tree
Showing 10 changed files with 378 additions and 191 deletions.
92 changes: 57 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,42 +66,23 @@ In this case, the user must provide only the path to the output folder of the ru
b. Using mode = '**continue_all_genomes**': process all the reference genomes again, without repeating the stage in which a blast database is built from the target genomes (which can be very time-consuming in case of many targets).
It only makes sense to use this mode with when running more than one reference genome.

### Usage examples using the provided sample data
(With the minimal required mandatory input parameters)

**A new run:**
```
python syntracker.py -target Sample_Data/Input_example/Target_genomes/ -ref Sample_Data/Input_example/Reference_genomes/ -out SynTracker_output/
```

**Continue a previous run that has been terminated:**

1. Continue from the last reference genome that has been processed without finishing successfully:
```
python syntracker.py -out SynTracker_output/ -mode continue
```

2. Process all the reference genome again without repeating the blastDB building stage
(relevant only for datasets containing more than one reference genome):
```
python syntracker.py -out SynTracker_output/ -mode continue_all_genomes
```

### A description of all SynTracker's possible command line arguments:

```
python syntracker.py [-h] [-target target_directory_path] [-ref ref_directory_path]
[-out output_directory_path] [-mode 'new'/'continue']
python syntracker.py [-h/--help] [-target target_directory_path]
[-ref ref_directory_path] [-out output_directory_path]
[-mode 'new'/'continue'/'continue_all_genomes']
[-blastDB blastDB_directory_path]
[-cores number_of_cores] [-length region_length]
[--identity blast_identity] [--coverage blast_coverage]
[--no_seed] [--avg_all]
[--no_seed]
options:
-h, --help show this help message and exit
-target [target_directory_path]
Path of the target directory which contains metagenome assemblies or genomes
Path of the target directory which contains metagenome assemblies or genomes
-ref [ref_directory_path]
Path of the references folder containing the reference genomes
Expand All @@ -117,6 +98,13 @@ options:
'continue' mode: continue from the last reference genome that was previously processed.
'continue_all_genomes' mode: process all the reference genomes again, without repeating the stage in which a blast database is built from the target genomes.
-blastDB [blastDB_directory_path]
The path to the directory which was previously created by syntracker_makeDB.py
and contains the uniquely renamed target genomes and the blastDB.
This is an advanced optional argument to be used when the blastDB has already been
created by syntracker_makeDB.py. When using it, there is no need to provide the
'-target' argument.
-cores [number_of_cores]
The number of cores to use for the multi-processed stages of the calculation.
(Optional, by default SynTracker uses the maximal number of available cores).
Expand All @@ -133,11 +121,45 @@ options:
--no_seed Set no seed for the subsampling of n regions per pairwise (optional).
This means that the average synteny scores may change between SynTracker runs due to the subsampling.
By default, a seed=1 is set to enable reproducibility between different runs.
--avg_all
Create an additional output table with APSS (Average Pairwise Synteny Scores),
which are based on all the available regions per each pair of samples
(in addition to the output tables, based on the subsampling of n regions).
```

### Usage examples using the provided sample data
(With the minimal required mandatory input parameters)

**A new run:**
```
python syntracker.py -target Sample_Data/Input_example/Target_genomes/ -ref Sample_Data/Input_example/Reference_genomes/ -out SynTracker_output/
```

**Continue a previous run that has been terminated:**

1. Continue from the last reference genome that has been processed without finishing successfully:
```
python syntracker.py -out SynTracker_output/ -mode continue
```

2. Process all the reference genomes again without repeating the blastDB building stage
(relevant only for datasets containing more than one reference genome):
```
python syntracker.py -out SynTracker_output/ -mode continue_all_genomes
```

**Advanced use-case: create blastDB and use it in distributed batches of reference genomes**

This usage is recommended when the input dataset contains many reference genomes that can be divided into batches
and be executed in a distributed way (as opposed to the normal SynTracker run, which runs the reference genomes one by one).
In this case, SynTracker should be executed in two separated stages:

Stage 1. Run the script `syntracker_makeDB.py` to create a directory containing the uniquely renamed target genomes
and the blast database created from them:
```
python syntracker_makeDB.py -target Sample_Data/Input_example/Target_genomes/ -out blastDB_output/
```

Stage 2. Run SynTracker using -blastDB argument, providing the previously created blastDB directory:
```
python syntracker.py -blastDB blastDB_output/ -out SynTracker_output/
```

## Output
Expand All @@ -154,9 +176,9 @@ The second type of output tables, `[genome name]_avg_synteny_scores_[subsampling
from the overall regions that appear in the raw table (detailed above).
By default, N equals to 40, 60, 80, 100, 200 regions per pair of samples.

In case the user has applied the --avg_all option, an additional table,
named `[genome name]_avg_synteny_scores_all_regions.csv` is created too. In this table, the APSS are calculated
using all the available regions per each pair of samples.
In addition, a table named `[genome name]_avg_synteny_scores_all_regions.csv` is created. This table outputs the APSS
(Average Pairwise Synteny Scores) calculated using all the available regions per each pair of samples without
subsampling.

#### Summary output (all genomes together):
Syntracker also creates the same output tables mentioned above for all the references genomes combined together.
Expand All @@ -166,7 +188,7 @@ The raw (per-region synteny scores) table is called `synteny_scores_per_region.c

The tables containing the APSS in different subsampling lengths are called `avg_synteny_scores_[subsampling length]_regions.csv`.

The table containing the APSS using all regions (in case of applying the --avg_all option)
The table containing the APSS using all regions (without subsampling)
is called `avg_synteny_scores_all_regions.csv`.

#### Sample output:
Expand Down
Binary file modified SynTracker_Manual.docx
Binary file not shown.
Binary file modified SynTracker_Manual.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion SynTracker_env.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: SynTracker_1_3
name: SynTracker_1_4
channels:
- conda-forge
- bioconda
Expand Down
3 changes: 2 additions & 1 deletion config.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@
blast_db_file_path = ""
blast_out_dir = "blast_output/"
blastdbcmd_out_dir = "blastdbcmd_output/"
is_syntracker_makeDB_dir = False
syntracker_makeDB_dir = ""

# Central regions related parameters
region_length = 1000
Expand All @@ -57,7 +59,6 @@
subsampled_regions_file_names = []
for i in range(len(subsampling_lengths)):
subsampled_regions_file_names.append("avg_synteny_scores_" + str(subsampling_lengths[i]) + "_regions.csv")
avg_all = False # Whether to add non-subsampled output (average all the regions per pair of samples)
avg_all_file_name = "avg_synteny_scores_all_regions.csv"

# Run related parameters
Expand Down
Loading

0 comments on commit bd6d56b

Please sign in to comment.