Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add real performance testing data #91

Merged
merged 5 commits into from
Oct 17, 2024

Conversation

Will-Tyler
Copy link
Contributor

@Will-Tyler Will-Tyler commented Oct 17, 2024

Overview

We want to understand vcztools' performance compared to bcftools. This pull request adds Makefile rules to download real genome data that we can test vcztools' performance with. The Makefile rules are mostly copied from the VCF Zarr publication repo. I modified the amount of data that the Makefile downloads so that the WGS file is ~250MB.

I also add some new performance testing commands in this pull request.

Results

The performance commands I added are slower on vcztools than bcftools, revealing some opportunities for improvement.

python -m compare 5 && python -m compare 6 && python -m compare 7
bcftools view -i 'FMT/DP>10 & FMT/GQ>10' data/chr22.vcf.gz
1.66GiB 0:00:11 [ 144MiB/s] [                            <=>                                                                                           ]

real    0m11.798s
user    0m11.278s
sys     0m0.508s

vcztools view -i 'FMT/DP>10 & FMT/GQ>10' data/chr22.vcz
1.66GiB 0:00:45 [37.4MiB/s] [                                                         <=>                                                              ]

real    0m45.604s
user    0m33.547s
sys     0m10.717s

bcftools view -i 'QUAL>10 || FMT/GQ>10' data/chr22.vcf.gz
1.67GiB 0:00:11 [ 154MiB/s] [                            <=>                                                                                           ]

real    0m11.103s
user    0m10.735s
sys     0m0.480s

vcztools view -i 'QUAL>10 || FMT/GQ>10' data/chr22.vcz
1.67GiB 0:00:42 [40.0MiB/s] [                                                           <=>                                                            ]

real    0m42.707s
user    0m32.329s
sys     0m10.241s

bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' data/chr22.vcf.gz
 349MiB 0:00:08 [40.8MiB/s] [                     <=>                                                                                                  ]

real    0m8.581s
user    0m8.479s
sys     0m0.212s

vcztools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' data/chr22.vcz
 349MiB 0:00:51 [6.85MiB/s] [                                                                                                                  <=>     ]

real    0m51.060s
user    0m51.190s
sys     0m0.757s

Testing

I tested these changes manually.

@jeromekelleher
Copy link
Contributor

Excellent! Can you profile the slower commands please so we can figure out what needs to be done?

I would be good to get a baseline "time to output all fields" on the real data as well, so we can get a feel for how well the C VCF text generating is doing.

@jeromekelleher jeromekelleher merged commit da76620 into sgkit-dev:main Oct 17, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants