-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build PyData prototype for GWAS analysis #20
Comments
I saw this issue pointed to from dask/dask-blog#38 . Some small comments
FYI I'm making a company around this question. Let me know if you want to chat or be beta testers for cloud deployment products.
They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair. Benchmarks are hard to do honestly. |
Hey Matt,
Will do, but deployment isn't a big concern quite yet.
Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with |
Not to my knowledge. The "loading from disk" behavior would be evident in
the dashboard by orange/red memory bars in the memory plot as well as lots
of orange bars showing up in the task stream (red is network transfer,
orange is disk transfer).
…On Tue, Apr 28, 2020 at 5:49 AM Eric Czech ***@***.***> wrote:
Hey Matt,
Let me know if you want to chat or be beta testers for cloud deployment
products.
Will do, but deployment isn't a big concern quite yet.
They first load the entire dataset in RAM. Pandas doesn't store string
data efficiently. As a result Dask is often spilling to disk during those
benchmarks, which is why it's slow. We encouraged them to just include the
time to read data from disk rather than starting from memory, but the
maintainers of the benchmark said that that would be unfair.
Good to know! It will definitely be helpful to see how we could get to
that conclusion with task stream monitoring. Performance with .persist()
(I assume that's what they're doing based on your description) isn't
particularly interesting for us so I'm not worried about the actual times
so much as being a better user. Do you happen to know if there is a dask
performance report for what they did somewhere?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTB3NG3RT3M7AFKV53TRO3GEXANCNFSM4MHHA56A>
.
|
This issue tracks several more specific issues related to working towards a usable prototype.
Some things we should tackle for this are:
The text was updated successfully, but these errors were encountered: