Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP worker disk size is too low #235

Closed
tomwhite opened this issue Jan 12, 2021 · 5 comments
Closed

GCP worker disk size is too low #235

tomwhite opened this issue Jan 12, 2021 · 5 comments
Labels
provider/gcp/vm Cluster provider for GCP Instances question Further information is requested

Comments

@tomwhite
Copy link
Contributor

The 50GB default has very slow read/write speeds (around 7MB/s according to the docs), which causes some workloads to perform very poorly.

I suggest increasing the default to 1000GB, which would give 120MB/s read/write throughput.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Jan 12, 2021

Thanks for raising this @tomwhite.

I think having a default disk size of 1TB per worker is a little excessive for most users. This option is configurable, so users can set this to whatever they prefer.

I would consider bumping the default to either 64GB or 128GB as per the documentation you linked. But beyond that I think things start getting wasteful.

Perhaps also adding a section to the docs with info about this would also be valuable.

@jacobtomlinson jacobtomlinson added provider/gcp/vm Cluster provider for GCP Instances question Further information is requested labels Jan 12, 2021
@tomwhite
Copy link
Contributor Author

Improved docs are always welcome, but most users will assume the defaults will work well for most things. The difficulty here is that GCP has a very unexpected connection between disk size and performance (that goes against what most users might intuitively expect), and it's very hard to diagnose since jobs just run more slowly than they might otherwise rather than failing outright. I suspect it also depends on the type of workload - and some parts of a job do worse than other parts, as we seem to be seeing in the linked issue above.

Another way of thinking about this problem is: how could we balance compute and storage costs? The default machine type is n1-standard-1 which costs $24.27 per month. Persistent disk costs $0.04 per GB per month. That works out at about 600GB to give parity between the two costs. So maybe a 500GB default?

A better improvement might be to implement #173, change the default machine type to one with multiple vCPUs, and try to amortize the disk across those cores, on the assumption that not all cores will be accessing the disk at once.

@jacobtomlinson
Copy link
Member

Thanks for the response

most users will assume the defaults will work well for most things

I very much agree with this. However when it comes to creating Dask clusters I think "most things" is a pretty broad area, and it's hard to optimise for.

For instance in the majority of Dask workflows that I've been involved with data is rarely read from or written to disk on the workers. Usually all read/write is from some object storage like GCS and working data is held in memory.

But I totally appreciate that my experience may differ from others, so input from the community is most welcome.

that goes against what most users might intuitively expect

This is useful insight. I am perhaps influenced by my sysadmin and cloud engineering background here.

If I choose a small VM, and a small disk it would be reasonable to expect a small connection between the two. Especially if you consider Google must be running some large physical server, and some large storage array and have some large network connection between the two. If I ask for a small slice of two of those I'm not going to get a large slice of the third.

That works out at about 600GB to give parity between the two costs.

I would potentially argue here that users do not typically factor disk costs into their calculations and often assume the disk cost is negligible. If I look up the pricing for an n1-standard-4 and then when using dask-cloudprovider my bill ends up being double that I'm going to be a little confused.

A better improvement might be to implement #173

Absolutely. I recently had dask/distributed#377 merged which is necessary for this. Once a new release of distributed is out we can begin implementing things here.

For users doing any substantial work I would expect them to increase the instance and disk sizes accordingly. Today performance may suffer a bit as one worker may struggle to utilise a large machine, but with that change large instances will be split up appropriately.


To pull back from this specific discussion I would find it interesting to think about who we are targeting with our defaults here.

Generally I target new users who may be playing with these technologies for the first time. Those users probably want to spin up a small cluster and have a poke around. They likely do not want to spend much money doing so.

But as you point out by setting the defaults reasonably low (nobody should be using n1-standard-1 in practice for instance) we are creating a scenario where everyone using the tool in anger will have to change the defaults.

This is a tricky line to walk and I'd appreciate some input from others. @quasiben had some interesting thoughts on this.

@tomwhite
Copy link
Contributor Author

tomwhite commented Feb 9, 2021

For instance in the majority of Dask workflows that I've been involved with data is rarely read from or written to disk on the workers. Usually all read/write is from some object storage like GCS and working data is held in memory.

I agree. I've been investigating this further (in https://github.com/pystatgen/sgkit/issues/390), and it seems that it is indeed best to avoid spilling to disk entirely. So the slow worker disk speed is not a problem for workloads that are predominately in memory, and the default is actually fine. Sorry for the distraction.

@tomwhite tomwhite closed this as completed Feb 9, 2021
@quasiben
Copy link
Member

quasiben commented Feb 9, 2021

Thanks for following up @tomwhite !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider/gcp/vm Cluster provider for GCP Instances question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants