Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of variant_stats #1119

Merged
merged 5 commits into from
Oct 3, 2023

Conversation

timothymillar
Copy link
Collaborator

@timothymillar timothymillar commented Aug 30, 2023

This updates variant_stats to use a pair of gufuncs to calculate variant counts and hom/het genotype counts respectively.
This seems to be the sweet spot for performance, It may be possible to use a single gufunc but I can't seem to maintain the same performance (probably due to the complexity of the inner loop). The two gufunc soultion is also nice for modularity. I also looked at using da.apply_gufunc to have multiple return values but the performance seems to be worse than map_blocks with a single return value.

This also changes the count_variant_alleles function to give the option of counting alleles from the call_genotype array directly rather than via the call_allele_count array. This introduces the using argument to specify the array to use. Feedback on this approach/argument name would be appreciated as this pattern could be used elsewhere. I have kept the old behavior as the default for now.

I seem to be getting around an 18-22x speedup compared to the current master branch. This is variable and seems to increase with numbers of samples which makes sense given the changes to count_variant_alleles (no longer producing intermediate counts for each sample). It'd be good if someone else can verify this as it's better than I expected.

Relative speedup was calculated using biallelic diploid data with 1% missing allele calls. The variants array was split into 8 chunks in all cases to ensure CPU saturation on an 8 core machine.

Relative speedup:

Threads 100k * 1k 20k * 5k 10k * 10k
1 21.1 22 23.4
2 19.5 19.7 19.8
4 18.6 18.9 19.1
6 19.2 19.8 20.5
8 17.8 18.6 18.7

The first value in each column name is the number of variants and the second is the number samples.

Note: I'm using dask.config.set(pool=ThreadPool(<number>)) to control thread number.

@codecov-commenter
Copy link

codecov-commenter commented Aug 30, 2023

Codecov Report

Merging #1119 (ee0d07c) into main (72ba776) will not change coverage.
The diff coverage is 100.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff            @@
##              main     #1119   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           50        50           
  Lines         5086      5102   +16     
=========================================
+ Hits          5086      5102   +16     
Files Changed Coverage Δ
sgkit/stats/popgen.py 100.00% <ø> (ø)
sgkit/stats/aggregation.py 100.00% <100.00%> (ø)
sgkit/stats/aggregation_numba_fns.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jeromekelleher
Copy link
Collaborator

Amazing, thanks @timothymillar! I'll try it out on my benchmark tomorrow and report back.

@timothymillar
Copy link
Collaborator Author

I've updated the sample_stats function to use the same gufunction. This gives a bit of a speed up but not as much as seen for variant_stats.

@timothymillar
Copy link
Collaborator Author

looks like some new test failures due to upstream changes. regenie, gwas_linear_regression and bgen tests are failing with FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version. At first glance it looks like this is caused by a change in pandas 2.1.0: pandas-dev/pandas#52986

@jeromekelleher
Copy link
Collaborator

I've tried running this by installing to my local setup @timothymillar, but I'm consistently getting a segfault on running the new code. I've tried clearing out the numba cache, but that doesn't seem to help. Any ideas what's going on?

@jeromekelleher
Copy link
Collaborator

Using SGKIT_DISABLE_NUMBA_CACHE=1 got rid of the segfault

@jeromekelleher
Copy link
Collaborator

This is a massive improvement, thanks @timothymillar! We're looking at more than 10X speedup in my benchmark.

@timothymillar
Copy link
Collaborator Author

@jeromekelleher my guess is that you've run into numba/numba#4807 which was the main motivation for #869. Do you get the same issues with other gufuncs when cached in parallel?

In practice, I've only run into this issue using gufuncs in parallel (via dask or mp) on our cluster, so I wondered if the NFS may increase the likelihood. I can also get segfaults on my local machine using the self-contained POC in that numba issue. I don't know why I only run into it using gufuncs, but based on that issue others have also had similar issue with @njit(parallel=True).

I've been meaning to follow up on it for some time but wanted to have a well parameterized example before adding to that thread.

@jeromekelleher
Copy link
Collaborator

Can you update this please @timothymillar, windows stuff should be fixed now.

@timothymillar timothymillar force-pushed the 1116-variant-stats-perf branch 3 times, most recently from 1c66f6a to dbfe66f Compare September 8, 2023 02:36
@timothymillar
Copy link
Collaborator Author

Note that I had to make some minor adjustments to the "getting started" docs because variant_stats now always returns a dask array, even if the input is a numpy array. I think this is in line with what we have done elsewhere.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify
Copy link
Contributor

mergify bot commented Sep 8, 2023

This PR has conflicts, @timothymillar please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 8, 2023
* Add count_variant_alleles option to calculate directly from calls
* Improve performance of variant_stats using gufuncs
* Raise error is variant_stats used on mixed-ploidy data
* Document behavior of variant_stats with partial genotype calls
@mergify mergify bot removed the conflict PR conflict label Sep 9, 2023
Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timothymillar timothymillar added the auto-merge Auto merge label for mergify test flight label Sep 13, 2023
@tomwhite tomwhite merged commit 0490144 into sgkit-dev:main Oct 3, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute variables in variant_stats concurrently
4 participants