Scaling fig shows poor performance relative to bcftools #34

jeromekelleher · 2023-08-19T10:49:45Z

In #32 we added a preliminary figure intented to show how sgkit scales for large data on a simple calculation, requiring looking at all the genotypes. This is what we got:

Here's the raw data for threads==1:

    Unnamed: 0  num_samples  num_sites      prog  threads  user_time  sys_time     wall_time
0            0           10     116230  bcftools        1       0.09      0.00      0.110721
1            1           10     116230     sgkit        1       1.55      0.15      1.888352
6            6          100     204714  bcftools        1       0.44      0.01      0.485648
7            7          100     204714     sgkit        1       6.52      0.34      6.573964
17          12         1000     403989  bcftools        1       5.71      0.04      5.776134
16          13         1000     403989     sgkit        1      75.28      4.23     75.771989
18          18        10000     863998  bcftools        1      86.59      0.24     86.894940
19          19        10000     863998     sgkit        1    1599.32     55.18   1578.908445
24          24       100000    2365367  bcftools        1    2239.48      1.99   2242.193691
25          25       100000    2365367     sgkit        1   42960.00   1454.64  42453.703301

So, in single-threaded terms (or the overall amount of compute required) we're looking at sgkit being about 20X slower than bcftools. This means we need at least 20 threads to just get to the same wall-clock time, which isn't great.

It would be nice if we could get to within a factor of 10 anyway. Here's the current code, for reference:

def get_prob_dist(ds, num_bins=10):
    bins = np.linspace(0, 1.0, num_bins + 1)
    bins[-1] += 0.01
    af = ds.variant_allele_frequency.values[:, 1]
    pRA = 2 * af * (1 - af)
    pAA = af * af
    hets = ds.variant_n_het.values
    homs = ds.variant_n_hom_alt.values
    a = np.bincount(np.digitize(pRA, bins), weights=hets, minlength=num_bins + 1)
    b = np.bincount(np.digitize(pAA, bins), weights=homs, minlength=num_bins + 1)

    count = (a + b).astype(int)
    return pd.DataFrame({"start": bins[:-1], "stop": bins[1:], "prob_dist": count[1:]})

(This assumes we have done something like ds = sg.variant_stats(ds) beforehand.)

I've had a quick look at the dask profiles, but it's not obvious to me how to make this faster. I guess part of the issue is that we're doing three separate full decodings of the genotype data by having:

af = ds.variant_allele_frequency.values
hets = ds.variant_n_het.values
homs = ds.variant_n_hom_alt.values

Also, this is being put forward as a canonical example of how the library should be used (the source code is included in the fig in #32) so please do suggest ways of making it more idiomatic if you think it can be improved. It's just intended to reproduce the output of bcftools +af-dist, which was chosen arbitrarily as something that needed decoding all genotypes and was easy to reproduce. Note the code doesn't quite do exactly what bcftools does, the binning is a bit quirky there (I think).

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2023-08-20T08:26:30Z

Here's the bcftools code, for reference: https://github.com/samtools/bcftools/blob/develop/plugins/af-dist.c

jeromekelleher · 2023-09-05T08:50:14Z

Here's the updated plot with @timothymillar's new code in https://github.com/pystatgen/sgkit/pull/1119

Which is great! 🎉 This is exactly the performance regime we were expecting, so we can write a story about this now.

The left-hand side is quite high because of the compile cost we're getting each time at the moment (because I was hitting some numba segfaults). Hopefully we'll be able to sort this out at some point. I'll be rerunning this when we have data for 1M samples anyway.

Here's the raw data for 1 thread, for reference:

    Unnamed: 0  num_samples  num_sites      prog  threads  user_time  sys_time    wall_time
0            0           10     116230  bcftools        1       0.15      0.00     0.170811
1            1           10     116230     sgkit        1       6.21      0.25     6.432374
6            6          100     204714  bcftools        1       0.44      0.01     0.483492
7            7          100     204714     sgkit        1       6.84      0.38     7.092373
17          12         1000     403989  bcftools        1       4.71      0.01     4.757795
16          13         1000     403989     sgkit        1      12.40      0.85    12.790526
18          18        10000     863998  bcftools        1      85.80      0.46    86.351287
19          19        10000     863998     sgkit        1     117.31     12.18   148.714906
24          24       100000    2365367  bcftools        1    2227.38      3.87  2232.036372
25          25       100000    2365367     sgkit        1    2604.82    357.28  3303.462654

It's interesting to note that the sys time is 100X for sgkit on 100K samples. I think this is a consequence of the data being scattered over 23700 files for sgkit, and therefore the data isn't being accessed in a nice sequential pattern like it is for the BCF file. But it doesn't really matter.

benjeffery · 2023-09-05T10:47:37Z

Are the files on an SSD or magnetic drive? Might make a difference for sgkit if the reads aren't sequential?

jeromekelleher · 2023-09-05T11:01:49Z

Magnetic, and yes, it probably would help to store on SSD. The files are getting big though with 1M samples, and it's a faff trying to squeeze everything onto SSD.

jeromekelleher mentioned this issue Aug 20, 2023

Compute variables in variant_stats concurrently sgkit-dev/sgkit#1116

Closed

jeromekelleher closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling fig shows poor performance relative to bcftools #34

Scaling fig shows poor performance relative to bcftools #34

jeromekelleher commented Aug 19, 2023 •

edited

Loading

jeromekelleher commented Aug 20, 2023

jeromekelleher commented Sep 5, 2023

benjeffery commented Sep 5, 2023

jeromekelleher commented Sep 5, 2023

Scaling fig shows poor performance relative to bcftools #34

Scaling fig shows poor performance relative to bcftools #34

Comments

jeromekelleher commented Aug 19, 2023 • edited Loading

jeromekelleher commented Aug 20, 2023

jeromekelleher commented Sep 5, 2023

benjeffery commented Sep 5, 2023

jeromekelleher commented Sep 5, 2023

jeromekelleher commented Aug 19, 2023 •

edited

Loading