-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling fig shows poor performance relative to bcftools #34
Comments
Here's the bcftools code, for reference: https://github.com/samtools/bcftools/blob/develop/plugins/af-dist.c |
Here's the updated plot with @timothymillar's new code in https://github.com/pystatgen/sgkit/pull/1119 Which is great! 🎉 This is exactly the performance regime we were expecting, so we can write a story about this now. The left-hand side is quite high because of the compile cost we're getting each time at the moment (because I was hitting some numba segfaults). Hopefully we'll be able to sort this out at some point. I'll be rerunning this when we have data for 1M samples anyway. Here's the raw data for 1 thread, for reference:
It's interesting to note that the sys time is 100X for sgkit on 100K samples. I think this is a consequence of the data being scattered over 23700 files for sgkit, and therefore the data isn't being accessed in a nice sequential pattern like it is for the BCF file. But it doesn't really matter. |
Are the files on an SSD or magnetic drive? Might make a difference for sgkit if the reads aren't sequential? |
Magnetic, and yes, it probably would help to store on SSD. The files are getting big though with 1M samples, and it's a faff trying to squeeze everything onto SSD. |
In #32 we added a preliminary figure intented to show how sgkit scales for large data on a simple calculation, requiring looking at all the genotypes. This is what we got:
Here's the raw data for threads==1:
So, in single-threaded terms (or the overall amount of compute required) we're looking at sgkit being about 20X slower than bcftools. This means we need at least 20 threads to just get to the same wall-clock time, which isn't great.
It would be nice if we could get to within a factor of 10 anyway. Here's the current code, for reference:
(This assumes we have done something like
ds = sg.variant_stats(ds)
beforehand.)I've had a quick look at the dask profiles, but it's not obvious to me how to make this faster. I guess part of the issue is that we're doing three separate full decodings of the genotype data by having:
Also, this is being put forward as a canonical example of how the library should be used (the source code is included in the fig in #32) so please do suggest ways of making it more idiomatic if you think it can be improved. It's just intended to reproduce the output of
bcftools +af-dist
, which was chosen arbitrarily as something that needed decoding all genotypes and was easy to reproduce. Note the code doesn't quite do exactly what bcftools does, the binning is a bit quirky there (I think).The text was updated successfully, but these errors were encountered: