-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of variant_stats
#1119
Improve performance of variant_stats
#1119
Conversation
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@ Coverage Diff @@
## main #1119 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 50 50
Lines 5086 5102 +16
=========================================
+ Hits 5086 5102 +16
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Amazing, thanks @timothymillar! I'll try it out on my benchmark tomorrow and report back. |
I've updated the |
looks like some new test failures due to upstream changes. regenie, gwas_linear_regression and bgen tests are failing with |
I've tried running this by installing to my local setup @timothymillar, but I'm consistently getting a segfault on running the new code. I've tried clearing out the numba cache, but that doesn't seem to help. Any ideas what's going on? |
Using |
This is a massive improvement, thanks @timothymillar! We're looking at more than 10X speedup in my benchmark. |
@jeromekelleher my guess is that you've run into numba/numba#4807 which was the main motivation for #869. Do you get the same issues with other gufuncs when cached in parallel? In practice, I've only run into this issue using gufuncs in parallel (via dask or mp) on our cluster, so I wondered if the NFS may increase the likelihood. I can also get segfaults on my local machine using the self-contained POC in that numba issue. I don't know why I only run into it using gufuncs, but based on that issue others have also had similar issue with I've been meaning to follow up on it for some time but wanted to have a well parameterized example before adding to that thread. |
Can you update this please @timothymillar, windows stuff should be fixed now. |
1c66f6a
to
dbfe66f
Compare
Note that I had to make some minor adjustments to the "getting started" docs because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR has conflicts, @timothymillar please rebase and push updated version 🙏 |
* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls
dbfe66f
to
ee0d07c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sample_stats
functioncount_variant_alleles
changelog.rst
This updates
variant_stats
to use a pair of gufuncs to calculate variant counts and hom/het genotype counts respectively.This seems to be the sweet spot for performance, It may be possible to use a single gufunc but I can't seem to maintain the same performance (probably due to the complexity of the inner loop). The two gufunc soultion is also nice for modularity. I also looked at using
da.apply_gufunc
to have multiple return values but the performance seems to be worse thanmap_blocks
with a single return value.This also changes the
count_variant_alleles
function to give the option of counting alleles from thecall_genotype
array directly rather than via thecall_allele_count
array. This introduces theusing
argument to specify the array to use. Feedback on this approach/argument name would be appreciated as this pattern could be used elsewhere. I have kept the old behavior as the default for now.I seem to be getting around an 18-22x speedup compared to the current master branch. This is variable and seems to increase with numbers of samples which makes sense given the changes to
count_variant_alleles
(no longer producing intermediate counts for each sample). It'd be good if someone else can verify this as it's better than I expected.Relative speedup was calculated using biallelic diploid data with 1% missing allele calls. The variants array was split into 8 chunks in all cases to ensure CPU saturation on an 8 core machine.
Relative speedup:
The first value in each column name is the number of variants and the second is the number samples.
Note: I'm using
dask.config.set(pool=ThreadPool(<number>))
to control thread number.