-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate use of preemptible GCP instances for GWAS #453
Comments
I ran a few experiments to simulate preemption by stopping a worker VM midway through a job. Here is a normal run on a cluster on 16 instances with no preemption. It took 102s. Here is a run where I stopped one worker. The job took longer (124s), but completed fine: Stopping two workers extends the runtime even more (162s), but the job still completes: When I tried combining persisting the input dataset (#449) with preemption the results were more mixed. Stopping one worker had the effect of causing disk spilling: |
Note that all of these experiments were done just by stopping the worker abruptly. There is an unmerged Dask issue to make workers handle shutdown gracefully. The idea is that the worker shutting down would copy its memory state to other workers in the cluster, so the work doesn't need to be recomputed. This might be a challenge for the GCP preemption limits, however. An instance being preempted on GCP is given 30 seconds notice before being forcibly terminated. A The maximum worker-worker bandwidth I've seen on a cluster has been <0.5GB/s, so there's quite a gap there. (It still might be useful to give the worker notice of an impending shutdown so it doesn't accept new tasks though.) |
In #390 (and processing in general), using preemptible instances on GCP would bring a cost saving of ~5x.
The text was updated successfully, but these errors were encountered: