Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling fig shows poor performance relative to bcftools #34

Closed
jeromekelleher opened this issue Aug 19, 2023 · 4 comments
Closed

Scaling fig shows poor performance relative to bcftools #34

jeromekelleher opened this issue Aug 19, 2023 · 4 comments

Comments

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Aug 19, 2023

In #32 we added a preliminary figure intented to show how sgkit scales for large data on a simple calculation, requiring looking at all the genotypes. This is what we got:

data-scaling

Here's the raw data for threads==1:

    Unnamed: 0  num_samples  num_sites      prog  threads  user_time  sys_time     wall_time
0            0           10     116230  bcftools        1       0.09      0.00      0.110721
1            1           10     116230     sgkit        1       1.55      0.15      1.888352
6            6          100     204714  bcftools        1       0.44      0.01      0.485648
7            7          100     204714     sgkit        1       6.52      0.34      6.573964
17          12         1000     403989  bcftools        1       5.71      0.04      5.776134
16          13         1000     403989     sgkit        1      75.28      4.23     75.771989
18          18        10000     863998  bcftools        1      86.59      0.24     86.894940
19          19        10000     863998     sgkit        1    1599.32     55.18   1578.908445
24          24       100000    2365367  bcftools        1    2239.48      1.99   2242.193691
25          25       100000    2365367     sgkit        1   42960.00   1454.64  42453.703301

So, in single-threaded terms (or the overall amount of compute required) we're looking at sgkit being about 20X slower than bcftools. This means we need at least 20 threads to just get to the same wall-clock time, which isn't great.

It would be nice if we could get to within a factor of 10 anyway. Here's the current code, for reference:

def get_prob_dist(ds, num_bins=10):
    bins = np.linspace(0, 1.0, num_bins + 1)
    bins[-1] += 0.01
    af = ds.variant_allele_frequency.values[:, 1]
    pRA = 2 * af * (1 - af)
    pAA = af * af
    hets = ds.variant_n_het.values
    homs = ds.variant_n_hom_alt.values
    a = np.bincount(np.digitize(pRA, bins), weights=hets, minlength=num_bins + 1)
    b = np.bincount(np.digitize(pAA, bins), weights=homs, minlength=num_bins + 1)

    count = (a + b).astype(int)
    return pd.DataFrame({"start": bins[:-1], "stop": bins[1:], "prob_dist": count[1:]})

(This assumes we have done something like ds = sg.variant_stats(ds) beforehand.)

I've had a quick look at the dask profiles, but it's not obvious to me how to make this faster. I guess part of the issue is that we're doing three separate full decodings of the genotype data by having:

af = ds.variant_allele_frequency.values
hets = ds.variant_n_het.values
homs = ds.variant_n_hom_alt.values

Also, this is being put forward as a canonical example of how the library should be used (the source code is included in the fig in #32) so please do suggest ways of making it more idiomatic if you think it can be improved. It's just intended to reproduce the output of bcftools +af-dist, which was chosen arbitrarily as something that needed decoding all genotypes and was easy to reproduce. Note the code doesn't quite do exactly what bcftools does, the binning is a bit quirky there (I think).

@jeromekelleher
Copy link
Collaborator Author

Here's the bcftools code, for reference: https://github.com/samtools/bcftools/blob/develop/plugins/af-dist.c

@jeromekelleher
Copy link
Collaborator Author

Here's the updated plot with @timothymillar's new code in https://github.com/pystatgen/sgkit/pull/1119

data-scaling

Which is great! 🎉 This is exactly the performance regime we were expecting, so we can write a story about this now.

The left-hand side is quite high because of the compile cost we're getting each time at the moment (because I was hitting some numba segfaults). Hopefully we'll be able to sort this out at some point. I'll be rerunning this when we have data for 1M samples anyway.

Here's the raw data for 1 thread, for reference:

    Unnamed: 0  num_samples  num_sites      prog  threads  user_time  sys_time    wall_time
0            0           10     116230  bcftools        1       0.15      0.00     0.170811
1            1           10     116230     sgkit        1       6.21      0.25     6.432374
6            6          100     204714  bcftools        1       0.44      0.01     0.483492
7            7          100     204714     sgkit        1       6.84      0.38     7.092373
17          12         1000     403989  bcftools        1       4.71      0.01     4.757795
16          13         1000     403989     sgkit        1      12.40      0.85    12.790526
18          18        10000     863998  bcftools        1      85.80      0.46    86.351287
19          19        10000     863998     sgkit        1     117.31     12.18   148.714906
24          24       100000    2365367  bcftools        1    2227.38      3.87  2232.036372
25          25       100000    2365367     sgkit        1    2604.82    357.28  3303.462654

It's interesting to note that the sys time is 100X for sgkit on 100K samples. I think this is a consequence of the data being scattered over 23700 files for sgkit, and therefore the data isn't being accessed in a nice sequential pattern like it is for the BCF file. But it doesn't really matter.

@benjeffery
Copy link
Contributor

Are the files on an SSD or magnetic drive? Might make a difference for sgkit if the reads aren't sequential?

@jeromekelleher
Copy link
Collaborator Author

Magnetic, and yes, it probably would help to store on SSD. The files are getting big though with 1M samples, and it's a faff trying to squeeze everything onto SSD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants