Skip to content

Commit

Permalink
fix images
Browse files Browse the repository at this point in the history
  • Loading branch information
scharlottej13 committed Nov 21, 2024
1 parent 0bde311 commit d5a097d
Showing 1 changed file with 13 additions and 5 deletions.
18 changes: 13 additions & 5 deletions _posts/2024-11-21-dask-detrending.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ canonical_url: https://xarray.dev/blog/dask-detrending

_This post was originally published on the [Xarray blog](https://xarray.dev/blog/dask-detrending)._

Running large-scale GroupBy-Map patterns with Xarray that are backed by [Dask arrays](https://docs.dask.org/en/stable/array.html?utm_source=dask-blog) is
Running large-scale GroupBy-Map patterns with Xarray that are backed by [Dask arrays](https://docs.dask.org/en/stable/array.html) is
an essential part of a lot of typical geospatial workloads. Detrending is a very common
operation where this pattern is needed.

Expand Down Expand Up @@ -45,7 +45,9 @@ We are grouping by the day of the year and then are calculating the rolling aver

Our example will run on a 1 TiB array, 64 years worth of data and the following structure:

![Python repr output of 1 TiB Dask array with shape (1801, 3600, 233376) split into 5460, 250 MiB chunks of (300, 300, 365) ](/images/dask-detrending/input-array.png)
<figure class="align-center">
<img alt="Python repr output of 1 TiB Dask array with shape (1801, 3600, 233376) split into 5460, 250 MiB chunks of (300, 300, 365)" src="/images/dask-detrending/input-array.png" style="width: 600px;"/>
</figure>

The array isn't overly huge and the chunks are reasonably sized.

Expand All @@ -57,7 +59,9 @@ this operation unusable. Our array is sorted by time, which means that we have t
entries from many different areas in the array to create a single group (corresponding to a single day of the year).
Picking the same day of every year is basically a slicing operation with a step size of 365.

![Schematic showing an array sorted by time, where data is selected from many different areas in the array to create a single group (corresponding to a specific day of the year).](/images/dask-detrending/indexing-data-selection.png "Data Selection Pattern")
<figure class="align-center">
<img alt="Schematic showing an array sorted by time, where data is selected from many different areas in the array to create a single group (corresponding to a specific day of the year)." src="/images/dask-detrending/indexing-data-selection.png" title="Data Selection Pattern" style="width: 600px;"/>
</figure>

Our example has a year worth of data in a single chunk along the time axis. The general problem
exists for any workload where you have to access random entries of data. This
Expand All @@ -69,7 +73,9 @@ entry, e.g. each group will consist of as many chunks as we have year.

This results in a huge increase in the number of chunks:

![Python repr output of a 1 TiB Dask array with nearly 2 million, 700 kiB chunks.](/images/dask-detrending/output-array-old.png)
<figure class="align-center">
<img alt="Python repr output of a 1 TiB Dask array with nearly 2 million, 700 kiB chunks." src="/images/dask-detrending/output-array-old.png" style="width: 600px;"/>
</figure>

This simple operation increases the number of chunks from 5000 to close to 2 million. Each
chunk only has a few hundred kilobytes of data. **This is pretty bad!**
Expand All @@ -94,7 +100,9 @@ it will try to preserve the input chunksize as closely as possible.
For our initial example, it will put every group into a single chunk. This means that we will
end up with the number of chunks along the time axis being equal to the number of groups, i.e. 365.

![Python repr output of a 1 TiB Dask array with 31164, 43 MiB chunks](/images/dask-detrending/output-array-new.png)
<figure class="align-center">
<img alt="Python repr output of a 1 TiB Dask array with 31164, 43 MiB chunks" src="/images/dask-detrending/output-array-new.png" style="width: 600px;"/>
</figure>

The algorithm reduces the number of chunks from 2 million to roughly 30 thousand, which is a huge improvement
and a scale that Dask can easily handle. The graph is now much smaller, and the follow-up operations
Expand Down

0 comments on commit d5a097d

Please sign in to comment.