Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tests #89

Merged
merged 3 commits into from
Oct 14, 2024
Merged

Performance tests #89

merged 3 commits into from
Oct 14, 2024

Conversation

Will-Tyler
Copy link
Contributor

Overview

This pull request adds performance tests and closes #55.

Usage

To create the simulated testing data, run make -C performance/data from the project root.

Creating the testing data requires several tools: stdpopsim, tskit, bgzip, bcftools, and vcf2zarr.

To compare performance for all commands:

cd performance
python -m compare
bcftools view data/sim_10k.vcf.gz
10.2GiB 0:01:05 [ 159MiB/s] [                                                                                           <=>                                           ]

real    1m5.386s
user    1m4.818s
sys     0m2.131s

vcztools view data/sim_10k.vcz
10.2GiB 0:00:29 [ 358MiB/s] [                                                                                <=>                                                      ]

real    0m29.219s
user    0m34.108s
sys     0m3.471s

bcftools view -s tsk_7068,tsk_8769,tsk_8820 data/sim_10k.vcf.gz
13.9MiB 0:00:50 [ 285KiB/s] [                                                                                                                                   <=>   ]

real    0m50.004s
user    0m49.589s
sys     0m0.330s

vcztools view -s tsk_7068,tsk_8769,tsk_8820 data/sim_10k.vcz
13.9MiB 0:00:02 [6.72MiB/s] [        <=>                                                                                                                              ]

real    0m2.078s
user    0m3.413s
sys     0m0.681s

bcftools query -f '%CHROM %POS %REF %ALT{0}\n' data/sim_10k.vcf.gz
3.86MiB 0:00:04 [ 842KiB/s] [             <=>                                                                                                                         ]

real    0m4.694s
user    0m4.574s
sys     0m0.095s

vcztools query -f '%CHROM %POS %REF %ALT{0}\n' data/sim_10k.vcz
3.86MiB 0:00:04 [ 865KiB/s] [             <=>                                                                                                                         ]

real    0m4.569s
user    0m4.773s
sys     0m0.359s

bcftools query -f '%CHROM:%POS\n' -i 'POS=49887394 | POS=50816415' data/sim_10k.vcf.gz
22.0  B 0:00:04 [4.78  B/s] [  <=>                                                                                                                                    ]

real    0m4.605s
user    0m4.518s
sys     0m0.085s

vcztools query -f '%CHROM:%POS\n' -i 'POS=49887394 | POS=50816415' data/sim_10k.vcz
22.0  B 0:00:02 [9.66  B/s] [  <=>                                                                                                                                    ]

real    0m2.279s
user    0m2.754s
sys     0m0.281s

bcftools view -s '' --force-samples data/sim_10k.vcf.gz
Warn: subset called for sample that does not exist in header: ""... skipping
Warn: subsetting has removed all samples
11.5MiB 0:00:50 [ 235KiB/s] [                                                                                                                                   <=>   ]

real    0m50.049s
user    0m49.698s
sys     0m0.338s

vcztools view -s '' --force-samples data/sim_10k.vcz
12.0MiB 0:00:12 [ 984KiB/s] [                                <=>                                                                                                      ]

real    0m12.490s
user    0m15.680s
sys     0m1.768s

To select a command by index:

cd performance
python -m compare 2
bcftools query -f '%CHROM %POS %REF %ALT{0}\n' data/sim_10k.vcf.gz
3.86MiB 0:00:04 [ 861KiB/s] [             <=>                                                                                                                         ]

real    0m4.591s
user    0m4.451s
sys     0m0.093s

vcztools query -f '%CHROM %POS %REF %ALT{0}\n' data/sim_10k.vcz
3.86MiB 0:00:04 [ 842KiB/s] [             <=>                                                                                                                         ]

real    0m4.697s
user    0m5.183s
sys     0m0.431s

To manually compare a command:

cd performance
python -m compare query --list-samples
bcftools query --list-samples data/sim_10k.vcf.gz
86.8KiB 0:00:00 [7.62MiB/s] [  <=>                                                                                                                                    ]

real    0m0.014s
user    0m0.005s
sys     0m0.004s

vcztools query --list-samples data/sim_10k.vcz
86.8KiB 0:00:00 [ 109KiB/s] [  <=>                                                                                                                                    ]

real    0m0.797s
user    0m1.324s
sys     0m0.521s

Testing

I tested the Makefile and the compare script manually.

@jeromekelleher
Copy link
Contributor

Fantastic, great job @Will-Tyler! I love the simplicity 👍

And excellent performance all round too, this is great!

It would be good to have a requirements.txt in the performance directory for the Python requirements. I think we can assume that bgzip, bcftools and pv are installed. I guess it would be good to document this in the comments of the script, just to be clear.

The simulated data is very easy though, and I guess we do want to provide some real data too. We should add some 1000G data also (but can do this as a follow-up).

It would also be nice to collect the top level stats in a data frame, and output results as CSV.

@Will-Tyler
Copy link
Contributor Author

It would also be nice to collect the top level stats in a data frame, and output results as CSV.

I agree. Can this be a follow-up as well? I think it requires redirecting the error stream for both time and pv, which interferes with writing the output to the Python program's output stream.

Copy link
Contributor

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I successfully ran it on my local machine (Mac M1).

@jeromekelleher jeromekelleher merged commit e3b64ed into sgkit-dev:main Oct 14, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance tests
3 participants