Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate cost of GWAS regression steps #32

Open
eric-czech opened this issue Dec 18, 2020 · 3 comments
Open

Estimate cost of GWAS regression steps #32

eric-czech opened this issue Dec 18, 2020 · 3 comments

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Dec 18, 2020

This is an estimate of the VM rental time necessary to do the GWAS regressions (similar to #8).

Here are current figures:

  • There are 1760 phenotypes that need to be processed
    • The Neale Lab results contain 2891 phenotypes and the OT sumstats contain files for this many phenotypes as well
    • Our run of PHESANT, on a slightly different set of samples, produced 5384 phenotypes
    • There are 1760 in the intersection of the two
  • Running phenotypes on chr21 (141,910 variants) for 11 hr 5 mins produced results for 265 phenotypes when using a cluster of 60 n1-highmem-16 instances
  • The processing rate is then 141,910 variants * 265 phenotypes / 665 minutes = 942.51 phenotype-variant / second
    • For chr21, this equates to (665*60) / 265 = ~150 seconds per phenotype
  • For 10,691,206 variants over all contigs and 1,760 phenotypes, this implies a total processing time of ((10,691,206*1,760) / 942.51) / 86400 = ~231 days

A ballpark cost to keep a cluster of this size running that long is 60 nodes * (231 days *24 hrs) * $0.946424/hr = $314,818.

Clearly we have got to find some room to improve this.

@eric-czech
Copy link
Collaborator Author

On instance pricing:

  • Switching to standard instead of highmem instances would bring costs down by a factor of .8 (so 80% of original)
  • Moving to preemptible instances would bring costs down by a factor of .21

If various dask memory issues could be solved and we could use preemptible standard instances, the total cost would be around $53k.

@ravwojdyla
Copy link

@eric-czech I assume this stat:

Running phenotypes on chr21 (141,910 variants) for 11 hr 5 mins produced results for 265 phenotypes when using a cluster of 60 n1-highmem-16 instances

and this comment https://github.com/pystatgen/sgkit/issues/390#issuecomment-748205731 are for the same run, correct? And if it is, how is the 11 hr 5 mins here, connected with about 2hrs it took to run the regressions for chr21?

@eric-czech
Copy link
Collaborator Author

and this comment pystatgen/sgkit#390 (comment) are for the same run, correct?

No, that caption is definitely misleading -- I was either wrong when I wrote it or trying to make it clear that the individual phenotypes can be seen as single spikes. Here is a full version of that readout that also includes the run of the 265 phenotypes:

Screen Shot 2021-01-21 at 4 22 06 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants