Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of Timeseries task graph #270

Merged
merged 3 commits into from
Aug 18, 2023

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Aug 17, 2023

Possible alternative to #263

While reviewing #263, I realized we were going a bit "overboard" in Timeseries.random_state. More specifically, we were generating/storing 624 * len(dtypes) 32-bit integers for every task. Given that the timeseries utility is generally meant for demonstration, testing and benchmarking, I don't see any reason to generate/store more than len(dtypes) integers for each task.

This PR reduces the number of random integers we store for each task by a factor of 624 (therefore, significantly reducing the size of the graph).

Copy link
Collaborator

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, this was on my todo list as well.

@phofl phofl merged commit 8352e3b into dask:main Aug 18, 2023
4 checks passed
@rjzamora rjzamora deleted the reduce-timeseries-graph-size branch August 18, 2023 13:13
@phofl
Copy link
Collaborator

phofl commented Aug 22, 2023

It's still twice as slow compared to dask/dask for a 70GB time series with 100 float columns on a 15 machine cluster. Any idea how we can speed this up without removing the seed?

@rjzamora
Copy link
Member Author

It's still twice as slow compared to dask/dask for a 70GB time series with 100 float columns on a 15 machine cluster. Any idea how we can speed this up without removing the seed?

What takes twice as long? Is it generating the task graph or communicating it between client and scheduler?

@phofl
Copy link
Collaborator

phofl commented Aug 22, 2023

It's mostly the sending of the graph, it's still significantly larger. Both are equally fast if I set a seed explicitly

@rjzamora
Copy link
Member Author

Both are equally fast if I set a seed explicitly

Oh! I didn't realize that - very interesting.

@rjzamora
Copy link
Member Author

Hm, I'm not finding that setting a seed changes the graph size in any way. However, I do see that we only store a single integer for each partition in dask/dask. If I recall correctly, this is because we always generate the data for every element of dtypes (even for columns that have been dropped). In dask-expr, we store a separate seed for each of the original columns. However, we could always adopt the dask/dask approach to reduce the graph size.

@phofl
Copy link
Collaborator

phofl commented Aug 23, 2023

Setting the seed increases the graph size in dask/dask, that’s what makes the performance comparable. Sorry that was misleading earlier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants