Skip to content

The python multithreaded version of the famous cgmlst-dists bioinformatic tool by Torsten Seemann

Notifications You must be signed in to change notification settings

genpat-it/cgmlst-dists-py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This is the Python multithreaded version of cgmlst-dists originally developed by Torsten Seemann. The original repository can be found at https://github.com/tseemann/cgmlst-dists/.

cgmlst-dists is a tool used for calculating pairwise Hamming distances for genome profiles in a core genome multilocus sequence typing (cgMLST) schema. This Python version utilizes multithreading to enhance performance during distance calculations.

For more information on cgMLST and the original cgmlst-dists tool, please refer to the cgmlst-dists GitHub repository.

Usage

$ python cgmlst-dists.py --help
usage: cgmlst-dists.py [-h] [--input INPUT] [--output OUTPUT] [--skip_input_replacements] 
                       [--input_sep INPUT_SEP] [--output_sep OUTPUT_SEP] [--index_name INDEX_NAME]
                       [--matrix-format {full,lower-tri,upper-tri}] [--num_threads NUM_THREADS] 
                       [--chunk_size CHUNK_SIZE] [--missing_char MISSING_CHAR]
                       [--locus-completeness LOCUS_COMPLETENESS]
                       [--sample-completeness SAMPLE_COMPLETENESS] [--version]

Calculate pairwise Hamming distances. Version: 0.0.3

options:
  -h, --help            show this help message and exit
  --input INPUT         Path to the input TSV file
  --output OUTPUT       Path to save the output TSV file
  --skip_input_replacements
                        Skip input replacements when there are no strings in the input (to save unnecessary computations)
  --input_sep INPUT_SEP
                        Input file separator (default: '\t')
  --output_sep OUTPUT_SEP
                        Output file separator (default: '\t')
  --index_name INDEX_NAME
                        Name for the index column (default: 'cgmlst-dists')
  --matrix-format {full,lower-tri,upper-tri}
                        Format for the output matrix (default: full)
  --num_threads NUM_THREADS
                        Number of threads for parallel execution (default: half of available CPUs + 1)
  --chunk_size CHUNK_SIZE
                        Size of chunks to save the output file (default: 1000)
  --missing_char MISSING_CHAR
                        Character used for missing data (default: '-')
  --locus-completeness LOCUS_COMPLETENESS
                        Minimum percentage of non-missing data required for a locus (0-100)
  --sample-completeness SAMPLE_COMPLETENESS
                        Minimum percentage of non-missing data required for a sample (0-100)
  --version            show program's version number and exit

Data Filtering Example

The tool now supports filtering both loci and samples based on data completeness:

$ python cgmlst-dists.py --input input.tsv --output output.tsv --locus-completeness 85 --sample-completeness 85

Loading data from input.tsv...
Initial data shape: 100 samples × 50 loci

Applying locus completeness filter (threshold: 85%)...
Loci filtering details:
--------------------------------------------------------------------------------
Locus                          Completeness %  Status     Missing/Total
--------------------------------------------------------------------------------
locus1                              98.00%    INCLUDED   2/100
locus2                              82.00%    EXCLUDED   18/100
...

Applying sample completeness filter (threshold: 85%)...
Sample filtering details:
--------------------------------------------------------------------------------
Sample ID                      Completeness %  Status     Missing/Total
--------------------------------------------------------------------------------
sample1                             95.00%    INCLUDED   2/50
sample2                             78.00%    EXCLUDED   11/50
...

Final data shape after filtering: 80 samples × 45 loci
Calculating distances...

Validation

cgmlst-dists.py vs cgmlst-dists

System Specifications

cgmlst-dists.py was evaluated on a system running AlmaLinux version 8.8, featuring a Kernel version of 4.18.0-477.13.1.el8_8.x86_64, and powered by an Intel(R) Xeon(R) Gold 6252N CPU at 2.30GHz with 192 CPUs.

Note: If num_threads is not specified, the script defaults to using half of the available CPUs plus one.

boring.tab (5x6)

Source: https://github.com/tseemann/cgmlst-dists/blob/master/test/boring.tab

$ time ./cgmlst-dists test/boring.tab > validation/boring_c.tab
This is cgmlst-dists 0.4.0
Loaded 5 samples x 6 allele calls
Calculating distances: 100.00%
Writing distance matrix to stdout...

Done.

real    0m0.009s
user    0m0.002s
sys     0m0.003s

$ time python cgmlst-dists.py --input test/boring.tab --output validation/boring_py.tab 
Loaded matrix of 5 samples and 6 allele calls.
The final matrix will have 25 distances.
Calculations completed. Saving distances...
Process completed successfully.

Total time taken: 1.52 seconds

real    0m2.864s
user    0m4.569s
sys     0m12.362s


$ md5sum validation/boring_c.tab
f523a48b2339ab7d018fe9b69c3fc326  validation/boring_c.tab
$ md5sum validation/boring_py.tab
f523a48b2339ab7d018fe9b69c3fc326  validation/boring_py.tab

chewie.tab (10x10)

Source: https://github.com/tseemann/cgmlst-dists/blob/master/test/chewie.tab

$ time ./cgmlst-dists test/chewie.tab > validation/chewie_c.tab
This is cgmlst-dists 0.4.0
Loaded 10 samples x 10 allele calls
Calculating distances: 100.00%
Writing distance matrix to stdout...

Done.

real    0m0.008s
user    0m0.002s
sys     0m0.005s

$ time python cgmlst-dists.py --input test/chewie.tab --output validation/chewie_py.tab 
Loaded matrix of 10 samples and 10 allele calls.
The final matrix will have 100 distances.
Calculations completed. Saving distances...
Process completed successfully.

Total time taken: 1.52 seconds

real    0m2.888s
user    0m4.341s
sys     0m12.611s

$ md5sum validation/chewie_c.tab
de4ba5b0bb0c93fb6fb1ea90467c02ab  validation/chewie_c.tab
$ md5sum validation/chewie_py.tab
de4ba5b0bb0c93fb6fb1ea90467c02ab  validation/chewie_py.tab

100.tab (100x3016)

Source: https://github.com/tseemann/cgmlst-dists/blob/master/test/100.tab

$ time ./cgmlst-dists test/100.tab > validation/100_c.tab
This is cgmlst-dists 0.4.0
Loaded 100 samples x 3016 allele calls
Calculating distances: 100.00%
Writing distance matrix to stdout...

Done.

real    0m0.086s
user    0m0.079s
sys     0m0.005s

$ time python cgmlst-dists.py --input test/100.tab --output validation/100_py.tab 
Loaded matrix of 100 samples and 3016 allele calls.
The final matrix will have 10000 distances.
Calculations completed. Saving distances...
Process completed successfully.

Total time taken: 2.24 seconds

real    0m3.784s
user    0m10.233s
sys     0m14.935s

$ md5sum validation/100_c.tab
5a62236c697ef1eb56b7065d406007af  validation/100_c.tab
$ md5sum validation/100_py.tab
5a62236c697ef1eb56b7065d406007af  validation/100_py.tab

crc32.tab (3933x1748)

This input matrix does not contain strings, so in the Python version, replacements are skipped.

$ time ./cgmlst-dists test/crc32.tab > validation/crc32_c.tab
This is cgmlst-dists 0.4.0
Loaded 3933 samples x 1748 allele calls
Calculating distances: 100.00%
Writing distance matrix to stdout...

Done.

real    0m57.766s
user    0m57.325s
sys     0m0.245s

$ time python cgmlst-dists.py --input test/crc32.tab --output validation/crc32_py.tab --skip_input_replacements
Loaded matrix of 3933 samples and 1748 allele calls.
The final matrix will have 15468489 distances.
Calculations completed. Saving distances...
Process completed successfully.

Total time taken: 7.00 seconds

real    0m8.435s
user    1m42.790s
sys     0m16.848s

$ md5sum validation/crc32_c.tab 
e3f13c85c9028d49a4867eadba20c11a  validation/crc32_c.tab
$ md5sum validation/crc32_py.tab 
e3f13c85c9028d49a4867eadba20c11a  validation/crc32_py.tab

5000.tab (5000x3016)

$ time ./cgmlst-dists test/5000.tab > validation/5000_c.tab
This is cgmlst-dists 0.4.0
Loaded 5000 samples x 3016 allele calls
Calculating distances: 100.00%
Writing distance matrix to stdout...

Done.

real    2m43.855s
user    2m42.837s
sys     0m0.348s

$ time python cgmlst-dists.py --input test/5000.tab --output validation/5000_py.tab 
Loaded matrix of 5000 samples and 3016 allele calls.
The final matrix will have 25000000 distances.
Calculations completed. Saving distances...
Process completed successfully.

Total time taken: 24.89 seconds

real    0m26.361s
user    4m3.116s
sys     0m15.963s

$ md5sum validation/5000_c.tab 
2e6d5d6c8856ef408e4f596a1841bdf6  validation/5000_c.tab
$ md5sum validation/5000_py.tab 
2e6d5d6c8856ef408e4f596a1841bdf6  validation/5000_py.tab

Docker

build

docker build -t cgmlst-dists-py .

help

docker run --rm cgmlst-dists-py

launch

docker run --rm -v "$(pwd):/app/data" cgmlst-dists-py --input data/test/100.tab --output data/100_py.tab

Comments

  • When the initial matrix is large is convenient to use the python version
  • The C version does not implement multithreading
  • The C version suffers from memory problems when the input is large
  • The Python version supports data quality filtering through completeness thresholds for both loci and samples
  • Detailed reporting helps identify problematic loci and samples before distance calculation

About

The python multithreaded version of the famous cgmlst-dists bioinformatic tool by Torsten Seemann

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published