Improve performance of `variant_stats` #1119

timothymillar · 2023-08-30T03:59:43Z

Fixes Compute variables in variant_stats concurrently #1116
Tests added
Update the relate sample_stats function
Update Tajimas_D to use faster count_variant_alleles
User visible changes (including notable bug fixes) are documented in changelog.rst

This updates variant_stats to use a pair of gufuncs to calculate variant counts and hom/het genotype counts respectively.
This seems to be the sweet spot for performance, It may be possible to use a single gufunc but I can't seem to maintain the same performance (probably due to the complexity of the inner loop). The two gufunc soultion is also nice for modularity. I also looked at using da.apply_gufunc to have multiple return values but the performance seems to be worse than map_blocks with a single return value.

This also changes the count_variant_alleles function to give the option of counting alleles from the call_genotype array directly rather than via the call_allele_count array. This introduces the using argument to specify the array to use. Feedback on this approach/argument name would be appreciated as this pattern could be used elsewhere. I have kept the old behavior as the default for now.

I seem to be getting around an 18-22x speedup compared to the current master branch. This is variable and seems to increase with numbers of samples which makes sense given the changes to count_variant_alleles (no longer producing intermediate counts for each sample). It'd be good if someone else can verify this as it's better than I expected.

Relative speedup was calculated using biallelic diploid data with 1% missing allele calls. The variants array was split into 8 chunks in all cases to ensure CPU saturation on an 8 core machine.

Relative speedup:

Threads	100k * 1k	20k * 5k	10k * 10k
1	21.1	22	23.4
2	19.5	19.7	19.8
4	18.6	18.9	19.1
6	19.2	19.8	20.5
8	17.8	18.6	18.7

The first value in each column name is the number of variants and the second is the number samples.

Note: I'm using dask.config.set(pool=ThreadPool(<number>)) to control thread number.

codecov-commenter · 2023-08-30T04:14:14Z

Codecov Report

Merging #1119 (ee0d07c) into main (72ba776) will not change coverage.
The diff coverage is 100.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff            @@
##              main     #1119   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           50        50           
  Lines         5086      5102   +16     
=========================================
+ Hits          5086      5102   +16

Files Changed	Coverage Δ
sgkit/stats/popgen.py	`100.00% <ø> (ø)`
sgkit/stats/aggregation.py	`100.00% <100.00%> (ø)`
sgkit/stats/aggregation_numba_fns.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jeromekelleher · 2023-08-30T12:59:07Z

Amazing, thanks @timothymillar! I'll try it out on my benchmark tomorrow and report back.

timothymillar · 2023-09-04T03:37:20Z

I've updated the sample_stats function to use the same gufunction. This gives a bit of a speed up but not as much as seen for variant_stats.

timothymillar · 2023-09-04T08:58:08Z

looks like some new test failures due to upstream changes. regenie, gwas_linear_regression and bgen tests are failing with FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version. At first glance it looks like this is caused by a change in pandas 2.1.0: pandas-dev/pandas#52986

jeromekelleher · 2023-09-04T13:34:28Z

I've tried running this by installing to my local setup @timothymillar, but I'm consistently getting a segfault on running the new code. I've tried clearing out the numba cache, but that doesn't seem to help. Any ideas what's going on?

jeromekelleher · 2023-09-04T13:49:49Z

Using SGKIT_DISABLE_NUMBA_CACHE=1 got rid of the segfault

jeromekelleher · 2023-09-04T14:32:47Z

This is a massive improvement, thanks @timothymillar! We're looking at more than 10X speedup in my benchmark.

timothymillar · 2023-09-04T21:39:26Z

@jeromekelleher my guess is that you've run into numba/numba#4807 which was the main motivation for #869. Do you get the same issues with other gufuncs when cached in parallel?

In practice, I've only run into this issue using gufuncs in parallel (via dask or mp) on our cluster, so I wondered if the NFS may increase the likelihood. I can also get segfaults on my local machine using the self-contained POC in that numba issue. I don't know why I only run into it using gufuncs, but based on that issue others have also had similar issue with @njit(parallel=True).

I've been meaning to follow up on it for some time but wanted to have a well parameterized example before adding to that thread.

https://github.com/pystatgen/sgkit/pull/1119

jeromekelleher · 2023-09-07T15:23:58Z

Can you update this please @timothymillar, windows stuff should be fixed now.

timothymillar · 2023-09-08T03:02:56Z

Note that I had to make some minor adjustments to the "getting started" docs because variant_stats now always returns a dask array, even if the input is a numpy array. I think this is in line with what we have done elsewhere.

jeromekelleher

LGTM!

mergify · 2023-09-08T12:13:17Z

This PR has conflicts, @timothymillar please rebase and push updated version 🙏

* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls

tomwhite

LGTM

tomwhite mentioned this pull request Sep 4, 2023

Main build is failing with Pandas 2.1.0 #1121

Closed

jeromekelleher referenced this pull request in jeromekelleher/sgkit-publication Sep 5, 2023

Use Tim's improved code

39baa0c

https://github.com/pystatgen/sgkit/pull/1119

This was referenced Sep 5, 2023

Tweaks to make large VCF convert sgkit-dev/sgkit-publication#36

Merged

Scaling fig shows poor performance relative to bcftools sgkit-dev/sgkit-publication#34

Closed

timothymillar force-pushed the 1116-variant-stats-perf branch 3 times, most recently from 1c66f6a to dbfe66f Compare September 8, 2023 02:36

jeromekelleher approved these changes Sep 8, 2023

View reviewed changes

jeromekelleher requested a review from tomwhite September 8, 2023 12:10

mergify bot added the conflict PR conflict label Sep 8, 2023

timothymillar added 5 commits September 9, 2023 12:32

Fixup getting_started docs to handle dask array

798da78

Use gufunc in sample_stats

9e8bd1e

Faster count_variant_alleles in Tajimas_D

69f9293

:Update changelog

ee0d07c

timothymillar force-pushed the 1116-variant-stats-perf branch from dbfe66f to ee0d07c Compare September 9, 2023 00:37

mergify bot removed the conflict PR conflict label Sep 9, 2023

tomwhite approved these changes Sep 11, 2023

View reviewed changes

timothymillar added the auto-merge Auto merge label for mergify test flight label Sep 13, 2023

tomwhite merged commit 0490144 into sgkit-dev:main Oct 3, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `variant_stats` #1119

Improve performance of `variant_stats` #1119

timothymillar commented Aug 30, 2023 •

edited

Loading

codecov-commenter commented Aug 30, 2023 •

edited

Loading

jeromekelleher commented Aug 30, 2023

timothymillar commented Sep 4, 2023

timothymillar commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

timothymillar commented Sep 4, 2023

jeromekelleher commented Sep 7, 2023

timothymillar commented Sep 8, 2023

jeromekelleher left a comment

mergify bot commented Sep 8, 2023

tomwhite left a comment

Improve performance of variant_stats #1119

Improve performance of variant_stats #1119

Conversation

timothymillar commented Aug 30, 2023 • edited Loading

codecov-commenter commented Aug 30, 2023 • edited Loading

Codecov Report

jeromekelleher commented Aug 30, 2023

timothymillar commented Sep 4, 2023

timothymillar commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

timothymillar commented Sep 4, 2023

jeromekelleher commented Sep 7, 2023

timothymillar commented Sep 8, 2023

jeromekelleher left a comment

Choose a reason for hiding this comment

mergify bot commented Sep 8, 2023

tomwhite left a comment

Choose a reason for hiding this comment

Improve performance of `variant_stats` #1119

Improve performance of `variant_stats` #1119

timothymillar commented Aug 30, 2023 •

edited

Loading

codecov-commenter commented Aug 30, 2023 •

edited

Loading