use FixedSizeDiffCache for flows #581

visr · 2023-09-08T14:00:10Z

When profiling runs with the default autodiff=true, this line was responsible for 35% of the time and almost all allocations:

Ribasim/core/src/solve.jl

Line 1002 in b3eb044

flow = get_tmp(connectivity.flow, u)

With this PR that drops down to 0%.

connectivity.flow is a sparse matrix, but the DiffCache does not seem to like sparse matrixes. The dual_du field was a dense vector of length n x n x cache_size, and the get_tmp call led to further allocations trying to restructure the sparse matrix from the vector. Luckily there is the FixedSizeDiffCache that helps here: https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache

This retains the sparsity in the dual, and returns a ReinterpretArray from get_tmp during autodiff. To avoid materializing this reinterpretarray I needed to additionally fill the parent array with zeros rather than the array itself.

There is another unrelated performance fix here, and that is to concretely type the Parameter struct, by adding type parameters from its fields. Otherwise you have situations like

struct A
    a::Vector
end

where the compiler doesn't know the element type of the Vector, so it can perform less optimizations. The solution:

struct A{T}
    a::Vector{T}
end

Finally I consistently added AbstractVector/Matrix argument type annotations to ensure the ReinterpretArray could enter everywhere. And I renamed the functions to formulate flows to formulate_flow, to make it easier to separate them from the other formulate! methods.

When profiling runs with the default `autodiff=true`, this line was responsible for 35% of the time and almost all allocations: https://github.com/Deltares/Ribasim/blob/b3eb044a722d1655c5465bafe50951b75fe960d6/core/src/solve.jl#L1002 `connectivity.flow` is a sparse matrix, but the DiffCache does not seem to like sparse matrixes. The `dual_du` field was a dense vector of length n x n x cache_size, and the `get_tmp` call led to further allocations trying to restructure the sparse matrix from the vector. Luckily there is the FixedSizeDiffCache that helps here: https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache This retains the sparsity in the dual, and returns a `ReinterpretArray` from `get_tmp` during autodiff. To avoid materializing this reinterpretarray I needed to additionally fill the parent array with zeros rather than the array itself. There is another unrelated performance fix here, and that is to concretely type the Parameter struct, by adding type parameters from its fields. Otherwise you have situations like ``` struct A a::Vector end ``` where the compiler doesn't know the element type of the Vector, so it can perform less optimizations. The solution: ``` struct A{T} a::Vector{T} end ``` Finally I consistently added AbstractVector/Matrix argument type annotations to ensure the ReinterpretArray could enter everywhere. And I renamed the functions to formulate flows to `formulate_flow`, to make it easier to separate them from the other `formulate!` methods.

`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The Parameters contain both, and therefore it was free to construct a `Dictionary` in a frequently called function like `get_level`. However with autodiff, the values could be a ReinterpretArray with Duals instead of just a Vector. This meant that on Dictionary creation it would convert the ReinterpretArray to a Vector, leading to many allocations. This is on top of #581. After that, this was responsible for 94% of the time spent. With this PR that goes down to about 2%, leading to a nice little speedup.

Hofer-Julian

Impressive find! Also, thanks for the detailed explanation :D

`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The Parameters contain both, and therefore it was free to construct a `Dictionary` in a frequently called function like `get_level`. However with autodiff, the values could be a ReinterpretArray with Duals instead of just a Vector. This meant that on Dictionary creation it would convert the ReinterpretArray to a Vector, leading to many allocations. This is on top of #581. After that, this was responsible for 94% of the time spent. With this PR that goes down to about 2%, leading to a nice little speedup. --------- Co-authored-by: Hofer-Julian <[email protected]>

When profiling runs with the default `autodiff=true`, this line was responsible for 35% of the time and almost all allocations: https://github.com/Deltares/Ribasim/blob/b3eb044a722d1655c5465bafe50951b75fe960d6/core/src/solve.jl#L1002 With this PR that drops down to 0%. `connectivity.flow` is a sparse matrix, but the DiffCache does not seem to like sparse matrixes. The `dual_du` field was a dense vector of length n x n x cache_size, and the `get_tmp` call led to further allocations trying to restructure the sparse matrix from the vector. Luckily there is the FixedSizeDiffCache that helps here: https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache This retains the sparsity in the dual, and returns a `ReinterpretArray` from `get_tmp` during autodiff. To avoid materializing this reinterpretarray I needed to additionally fill the parent array with zeros rather than the array itself. There is another unrelated performance fix here, and that is to concretely type the Parameter struct, by adding type parameters from its fields. Otherwise you have situations like ```julia struct A a::Vector end ``` where the compiler doesn't know the element type of the Vector, so it can perform less optimizations. The solution: ```julia struct A{T} a::Vector{T} end ``` Finally I consistently added AbstractVector/Matrix argument type annotations to ensure the ReinterpretArray could enter everywhere. And I renamed the functions to formulate flows to `formulate_flow`, to make it easier to separate them from the other `formulate!` methods.

`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The Parameters contain both, and therefore it was free to construct a `Dictionary` in a frequently called function like `get_level`. However with autodiff, the values could be a ReinterpretArray with Duals instead of just a Vector. This meant that on Dictionary creation it would convert the ReinterpretArray to a Vector, leading to many allocations. This is on top of #581. After that, this was responsible for 94% of the time spent. With this PR that goes down to about 2%, leading to a nice little speedup. --------- Co-authored-by: Hofer-Julian <[email protected]>

When profiling runs with the default `autodiff=true`, this line was responsible for 35% of the time and almost all allocations: https://github.com/Deltares/Ribasim/blob/b3eb044a722d1655c5465bafe50951b75fe960d6/core/src/solve.jl#L1002 With this PR that drops down to 0%. `connectivity.flow` is a sparse matrix, but the DiffCache does not seem to like sparse matrixes. The `dual_du` field was a dense vector of length n x n x cache_size, and the `get_tmp` call led to further allocations trying to restructure the sparse matrix from the vector. Luckily there is the FixedSizeDiffCache that helps here: https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache This retains the sparsity in the dual, and returns a `ReinterpretArray` from `get_tmp` during autodiff. To avoid materializing this reinterpretarray I needed to additionally fill the parent array with zeros rather than the array itself. There is another unrelated performance fix here, and that is to concretely type the Parameter struct, by adding type parameters from its fields. Otherwise you have situations like ```julia struct A a::Vector end ``` where the compiler doesn't know the element type of the Vector, so it can perform less optimizations. The solution: ```julia struct A{T} a::Vector{T} end ``` Finally I consistently added AbstractVector/Matrix argument type annotations to ensure the ReinterpretArray could enter everywhere. And I renamed the functions to formulate flows to `formulate_flow`, to make it easier to separate them from the other `formulate!` methods.

`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The Parameters contain both, and therefore it was free to construct a `Dictionary` in a frequently called function like `get_level`. However with autodiff, the values could be a ReinterpretArray with Duals instead of just a Vector. This meant that on Dictionary creation it would convert the ReinterpretArray to a Vector, leading to many allocations. This is on top of #581. After that, this was responsible for 94% of the time spent. With this PR that goes down to about 2%, leading to a nice little speedup. --------- Co-authored-by: Hofer-Julian <[email protected]>

visr requested a review from Hofer-Julian September 8, 2023 14:00

s

ce69e3d

visr mentioned this pull request Sep 8, 2023

avoid allocating value vectors in get_level #582

Merged

visr added the performance Relates to runtime performance or convergence label Sep 8, 2023

Hofer-Julian approved these changes Sep 8, 2023

View reviewed changes

Hofer-Julian merged commit 8079781 into main Sep 8, 2023
18 checks passed

Hofer-Julian deleted the fixedsize branch September 8, 2023 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use FixedSizeDiffCache for flows #581

use FixedSizeDiffCache for flows #581

visr commented Sep 8, 2023

Hofer-Julian left a comment

use FixedSizeDiffCache for flows #581

use FixedSizeDiffCache for flows #581

Conversation

visr commented Sep 8, 2023

Hofer-Julian left a comment

Choose a reason for hiding this comment