Reduce size of `Timeseries` task graph #270

rjzamora · 2023-08-17T14:10:13Z

Possible alternative to #263

While reviewing #263, I realized we were going a bit "overboard" in Timeseries.random_state. More specifically, we were generating/storing 624 * len(dtypes) 32-bit integers for every task. Given that the timeseries utility is generally meant for demonstration, testing and benchmarking, I don't see any reason to generate/store more than len(dtypes) integers for each task.

This PR reduces the number of random integers we store for each task by a factor of 624 (therefore, significantly reducing the size of the graph).

phofl

thx, this was on my todo list as well.

phofl · 2023-08-22T14:14:39Z

It's still twice as slow compared to dask/dask for a 70GB time series with 100 float columns on a 15 machine cluster. Any idea how we can speed this up without removing the seed?

rjzamora · 2023-08-22T14:16:14Z

It's still twice as slow compared to dask/dask for a 70GB time series with 100 float columns on a 15 machine cluster. Any idea how we can speed this up without removing the seed?

What takes twice as long? Is it generating the task graph or communicating it between client and scheduler?

phofl · 2023-08-22T14:20:02Z

It's mostly the sending of the graph, it's still significantly larger. Both are equally fast if I set a seed explicitly

rjzamora · 2023-08-22T14:21:14Z

Both are equally fast if I set a seed explicitly

Oh! I didn't realize that - very interesting.

rjzamora · 2023-08-23T00:50:54Z

Hm, I'm not finding that setting a seed changes the graph size in any way. However, I do see that we only store a single integer for each partition in dask/dask. If I recall correctly, this is because we always generate the data for every element of dtypes (even for columns that have been dropped). In dask-expr, we store a separate seed for each of the original columns. However, we could always adopt the dask/dask approach to reduce the graph size.

phofl · 2023-08-23T07:24:32Z

Setting the seed increases the graph size in dask/dask, that’s what makes the performance comparable. Sorry that was misleading earlier

rjzamora added 3 commits August 16, 2023 18:40

reduce size of timeseries graph

9aaa23c

add graph-size test

6d36f9f

improve testing

e165acd

rjzamora mentioned this pull request Aug 17, 2023

Remove automatic seed value #263

Closed

phofl approved these changes Aug 18, 2023

View reviewed changes

phofl merged commit 8352e3b into dask:main Aug 18, 2023
4 checks passed

rjzamora deleted the reduce-timeseries-graph-size branch August 18, 2023 13:13

rjzamora mentioned this pull request Aug 23, 2023

Further reduce timeseries graph size #275

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce size of `Timeseries` task graph #270

Reduce size of `Timeseries` task graph #270

rjzamora commented Aug 17, 2023 •

edited

Loading

phofl left a comment

phofl commented Aug 22, 2023

rjzamora commented Aug 22, 2023

phofl commented Aug 22, 2023

rjzamora commented Aug 22, 2023

rjzamora commented Aug 23, 2023

phofl commented Aug 23, 2023

Reduce size of Timeseries task graph #270

Reduce size of Timeseries task graph #270

Conversation

rjzamora commented Aug 17, 2023 • edited Loading

phofl left a comment

Choose a reason for hiding this comment

phofl commented Aug 22, 2023

rjzamora commented Aug 22, 2023

phofl commented Aug 22, 2023

rjzamora commented Aug 22, 2023

rjzamora commented Aug 23, 2023

phofl commented Aug 23, 2023

Reduce size of `Timeseries` task graph #270

Reduce size of `Timeseries` task graph #270

rjzamora commented Aug 17, 2023 •

edited

Loading