Version 1.4.0:

- Added the script syntracker_makeDB.py to create a blastDB directory, that can be later used in multiple SynTracker runs. - Added the '-blastDB' command-line argument to use the directory that was created by syntracker_makeDB.py. - Removed the '--avg_all' command-line argument - this option became default. -Fixed a problem in the execution of the 'continue' mode.
leylabmpi · Nov 11, 2024 · bd6d56b · bd6d56b
1 parent bb6edd3
commit bd6d56b
Show file tree

Hide file tree

Showing 10 changed files with 378 additions and 191 deletions.
diff --git a/README.md b/README.md
@@ -66,42 +66,23 @@ In this case, the user must provide only the path to the output folder of the ru
       b. Using mode = '**continue_all_genomes**': process all the reference genomes again, without repeating the stage in which a blast database is built from the target genomes (which can be very time-consuming in case of many targets). 
 It only makes sense to use this mode with when running more than one reference genome.
 
-### Usage examples using the provided sample data
-(With the minimal required mandatory input parameters)
-
-**A new run:**
-```
-python syntracker.py -target Sample_Data/Input_example/Target_genomes/ -ref Sample_Data/Input_example/Reference_genomes/ -out SynTracker_output/
-```
-
-**Continue a previous run that has been terminated:**
-
-1. Continue from the last reference genome that has been processed without finishing successfully:
-```
-python syntracker.py -out SynTracker_output/ -mode continue
-```
-
-2. Process all the reference genome again without repeating the blastDB building stage 
-(relevant only for datasets containing more than one reference genome):
-```
-python syntracker.py -out SynTracker_output/ -mode continue_all_genomes
-```
-
 ### A description of all SynTracker's possible command line arguments:
 
 ```
-python syntracker.py [-h] [-target target_directory_path] [-ref ref_directory_path] 
-                     [-out output_directory_path] [-mode 'new'/'continue'] 
+python syntracker.py [-h/--help] [-target target_directory_path] 
+                     [-ref ref_directory_path] [-out output_directory_path] 
+                     [-mode 'new'/'continue'/'continue_all_genomes'] 
+                     [-blastDB blastDB_directory_path]
                      [-cores number_of_cores] [-length region_length] 
                      [--identity blast_identity] [--coverage blast_coverage] 
-                     [--no_seed] [--avg_all]
+                     [--no_seed]
 
 options:
   -h, --help        show this help message and exit
   
   -target [target_directory_path]
-                    Path of the target directory which contains metagenome assemblies or genomes
-                    
+                    Path of the target directory which contains metagenome assemblies or genomes           
+
   -ref [ref_directory_path]
                     Path of the references folder containing the reference genomes
                     
@@ -117,6 +98,13 @@ options:
                     'continue' mode: continue from the last reference genome that was previously processed.
                     'continue_all_genomes' mode: process all the reference genomes again, without repeating the stage in which a blast database is built from the target genomes.
   
+  -blastDB [blastDB_directory_path]
+                    The path to the directory which was previously created by syntracker_makeDB.py
+                    and contains the uniquely renamed target genomes and the blastDB.
+                    This is an advanced optional argument to be used when the blastDB has already been 
+                    created by syntracker_makeDB.py. When using it, there is no need to provide the
+                    '-target' argument.
+                    
   -cores [number_of_cores]
                     The number of cores to use for the multi-processed stages of the calculation. 
                     (Optional, by default SynTracker uses the maximal number of available cores).
@@ -133,11 +121,45 @@ options:
   --no_seed         Set no seed for the subsampling of n regions per pairwise (optional). 
                     This means that the average synteny scores may change between SynTracker runs due to the subsampling. 
                     By default, a seed=1 is set to enable reproducibility between different runs.
- 
-  --avg_all   
-                    Create an additional output table with APSS (Average Pairwise Synteny Scores), 
-                    which are based on all the available regions per each pair of samples 
-                    (in addition to the output tables, based on the subsampling of n regions).                 
+                  
+```
+
+### Usage examples using the provided sample data
+(With the minimal required mandatory input parameters)
+
+**A new run:**
+```
+python syntracker.py -target Sample_Data/Input_example/Target_genomes/ -ref Sample_Data/Input_example/Reference_genomes/ -out SynTracker_output/
+```
+
+**Continue a previous run that has been terminated:**
+
+1. Continue from the last reference genome that has been processed without finishing successfully:
+```
+python syntracker.py -out SynTracker_output/ -mode continue
+```
+
+2. Process all the reference genomes again without repeating the blastDB building stage 
+(relevant only for datasets containing more than one reference genome):
+```
+python syntracker.py -out SynTracker_output/ -mode continue_all_genomes
+```
+
+**Advanced use-case: create blastDB and use it in distributed batches of reference genomes**
+
+This usage is recommended when the input dataset contains many reference genomes that can be divided into batches 
+and be executed in a distributed way (as opposed to the normal SynTracker run, which runs the reference genomes one by one).
+In this case, SynTracker should be executed in two separated stages:
+
+Stage 1. Run the script `syntracker_makeDB.py` to create a directory containing the uniquely renamed target genomes
+and the blast database created from them:
+```
+python syntracker_makeDB.py -target Sample_Data/Input_example/Target_genomes/ -out blastDB_output/
+```
+
+Stage 2. Run SynTracker using -blastDB argument, providing the previously created blastDB directory:
+```
+python syntracker.py -blastDB blastDB_output/ -out SynTracker_output/ 
 ```
 
 ## Output
@@ -154,9 +176,9 @@ The second type of output tables, `[genome name]_avg_synteny_scores_[subsampling
 from the overall regions that appear in the raw table (detailed above). 
 By default, N equals to 40, 60, 80, 100, 200 regions per pair of samples.
 
-In case the user has applied the --avg_all option, an additional table, 
-named `[genome name]_avg_synteny_scores_all_regions.csv` is created too. In this table, the APSS are calculated 
-using all the available regions per each pair of samples. 
+In addition, a table named `[genome name]_avg_synteny_scores_all_regions.csv` is created. This table outputs the APSS 
+(Average Pairwise Synteny Scores) calculated using all the available regions per each pair of samples without 
+subsampling.
 
 #### Summary output (all genomes together):
 Syntracker also creates the same output tables mentioned above for all the references genomes combined together. 
@@ -166,7 +188,7 @@ The raw (per-region synteny scores) table is called `synteny_scores_per_region.c
 
 The tables containing the APSS in different subsampling lengths are called `avg_synteny_scores_[subsampling length]_regions.csv`.
 
-The table containing the APSS using all regions (in case of applying the --avg_all option) 
+The table containing the APSS using all regions (without subsampling) 
 is called `avg_synteny_scores_all_regions.csv`.
 
 #### Sample output:

diff --git a/SynTracker_Manual.docx b/SynTracker_Manual.docx
diff --git a/SynTracker_Manual.pdf b/SynTracker_Manual.pdf
diff --git a/SynTracker_env.yml b/SynTracker_env.yml
@@ -1,4 +1,4 @@
-name: SynTracker_1_3
+name: SynTracker_1_4
 channels:
   - conda-forge
   - bioconda

diff --git a/config.py b/config.py
@@ -33,6 +33,8 @@
 blast_db_file_path = ""
 blast_out_dir = "blast_output/"
 blastdbcmd_out_dir = "blastdbcmd_output/"
+is_syntracker_makeDB_dir = False
+syntracker_makeDB_dir = ""
 
 # Central regions related parameters
 region_length = 1000
@@ -57,7 +59,6 @@
 subsampled_regions_file_names = []
 for i in range(len(subsampling_lengths)):
     subsampled_regions_file_names.append("avg_synteny_scores_" + str(subsampling_lengths[i]) + "_regions.csv")
-avg_all = False  # Whether to add non-subsampled output (average all the regions per pair of samples)
 avg_all_file_name = "avg_synteny_scores_all_regions.csv"
 
 # Run related parameters