Replies: 1 comment 1 reply
-
Hi @sandias42, We're working on extending our spilling mechanisms with our New Execution Engine. This should enable us to explicitly spill to sources like S3 or S3 Express! Check out the PR: #2347 :) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Daft appears to be extremely fast at reading and writing to cloud blob storage like S3. In my limited testing, it's fast enough to almost fully saturate my EC2 instance network bandwidth. My question is: why not use S3 to "spill" data when materializing, instead of disk? Or, why not have a
.cloud_collect()
option which performs all the computations but writes to a (temporary) s3 directory instead of RAM/ disk?Rationale
I haven't fully explored this yet, but so far it seems that network bandwidth is often higher than disk bandwidth, sometimes by a lot. For instance, gp3 EBS volumes typically get 125 MB/s and max out at a max throughput of 1000 MB/s if you pay more, while r5n.2xlarge have a minimum network bandwidth of ~1000 MB/s and burst to ~3000 MB/s for no extra cost. On top of this, once written to S3, any instance in the same region can have equally fast read limited only by the receiving instances bandwidth. Compare this with writing to disk. There, communication is limited both by the bandwidth of the receiving instance, and the bandwidth of the sending instance.
Materializing with
.collect()
has also been clunky for many of my workflows, which are typically interactive data cleaning and analysis of large datasets (this is not unique to daft, dask'spersist
also has this problem). Often times I want to do some interactive computations with an expensive intermediate (say, a checking everything worked correctly on a joined dataframe) but then I have some subsequent RAM-hungry steps I need to perform, and I don't have enough RAM holding both the materialized dataframe and the output of the next expensive computation. In the currentcollect
paradigm, its really annoying to "undo" acollect
call to free up resources.In practice, what I end up doing is actually a horribly clunky version of this feature! I write the expensive dataframe intermediate to S3, then restart my kernel/ reinitialize ray and pickup my analysis by reading from that intermediate's tmp directory on S3.
It seems plausible to me that the relative throughput advantages and comparative statelessness of
cloud_collect
as an alternative tocollect
should make it the default option for any largeish dataframe.But I don't know! Just thought I'd throw it out there. Been loving the project :D .
Beta Was this translation helpful? Give feedback.
All reactions