Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate use of preemptible GCP instances for GWAS #453

Closed
tomwhite opened this issue Feb 3, 2021 · 3 comments
Closed

Investigate use of preemptible GCP instances for GWAS #453

tomwhite opened this issue Feb 3, 2021 · 3 comments

Comments

@tomwhite
Copy link
Collaborator

tomwhite commented Feb 3, 2021

In #390 (and processing in general), using preemptible instances on GCP would bring a cost saving of ~5x.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Feb 3, 2021

I ran a few experiments to simulate preemption by stopping a worker VM midway through a job.

Here is a normal run on a cluster on 16 instances with no preemption. It took 102s.

no-preemption

Here is a run where I stopped one worker. The job took longer (124s), but completed fine:

stop1worker

Stopping two workers extends the runtime even more (162s), but the job still completes:

stop2workers

When I tried combining persisting the input dataset (#449) with preemption the results were more mixed. Stopping one worker had the effect of causing disk spilling:

persist_stop1worker

@tomwhite
Copy link
Collaborator Author

tomwhite commented Feb 3, 2021

Note that all of these experiments were done just by stopping the worker abruptly. There is an unmerged Dask issue to make workers handle shutdown gracefully. The idea is that the worker shutting down would copy its memory state to other workers in the cluster, so the work doesn't need to be recomputed.

This might be a challenge for the GCP preemption limits, however. An instance being preempted on GCP is given 30 seconds notice before being forcibly terminated. A n1-standard-8 instance has 30GB of memory, and a maximum egress bandwidth of 16Gbps. So that means it would take half the notice period (15s) to copy the data to another machine (at 2GB/s), not counting serialization cost etc.

The maximum worker-worker bandwidth I've seen on a cluster has been <0.5GB/s, so there's quite a gap there. (It still might be useful to give the worker notice of an impending shutdown so it doesn't accept new tasks though.)

@tomwhite
Copy link
Collaborator Author

tomwhite commented Jan 6, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant