-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce size of Timeseries
task graph
#270
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, this was on my todo list as well.
It's still twice as slow compared to dask/dask for a 70GB time series with 100 float columns on a 15 machine cluster. Any idea how we can speed this up without removing the seed? |
What takes twice as long? Is it generating the task graph or communicating it between client and scheduler? |
It's mostly the sending of the graph, it's still significantly larger. Both are equally fast if I set a seed explicitly |
Oh! I didn't realize that - very interesting. |
Hm, I'm not finding that setting a seed changes the graph size in any way. However, I do see that we only store a single integer for each partition in |
Setting the seed increases the graph size in dask/dask, that’s what makes the performance comparable. Sorry that was misleading earlier |
Possible alternative to #263
While reviewing #263, I realized we were going a bit "overboard" in
Timeseries.random_state
. More specifically, we were generating/storing624 * len(dtypes)
32-bit integers for every task. Given that thetimeseries
utility is generally meant for demonstration, testing and benchmarking, I don't see any reason to generate/store more thanlen(dtypes)
integers for each task.This PR reduces the number of random integers we store for each task by a factor of 624 (therefore, significantly reducing the size of the graph).