Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use FixedSizeDiffCache for flows #581

Merged
merged 2 commits into from
Sep 8, 2023
Merged

use FixedSizeDiffCache for flows #581

merged 2 commits into from
Sep 8, 2023

Conversation

visr
Copy link
Member

@visr visr commented Sep 8, 2023

When profiling runs with the default autodiff=true, this line was responsible for 35% of the time and almost all allocations:

flow = get_tmp(connectivity.flow, u)

With this PR that drops down to 0%.

connectivity.flow is a sparse matrix, but the DiffCache does not seem to like sparse matrixes. The dual_du field was a dense vector of length n x n x cache_size, and the get_tmp call led to further allocations trying to restructure the sparse matrix from the vector. Luckily there is the FixedSizeDiffCache that helps here: https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache

This retains the sparsity in the dual, and returns a ReinterpretArray from get_tmp during autodiff. To avoid materializing this reinterpretarray I needed to additionally fill the parent array with zeros rather than the array itself.

There is another unrelated performance fix here, and that is to concretely type the Parameter struct, by adding type parameters from its fields. Otherwise you have situations like

struct A
    a::Vector
end

where the compiler doesn't know the element type of the Vector, so it can perform less optimizations. The solution:

struct A{T}
    a::Vector{T}
end

Finally I consistently added AbstractVector/Matrix argument type annotations to ensure the ReinterpretArray could enter everywhere. And I renamed the functions to formulate flows to formulate_flow, to make it easier to separate them from the other formulate! methods.

When profiling runs with the default `autodiff=true`, this line was responsible for 35% of the time and almost all allocations:

https://github.com/Deltares/Ribasim/blob/b3eb044a722d1655c5465bafe50951b75fe960d6/core/src/solve.jl#L1002

`connectivity.flow` is a sparse matrix, but the DiffCache does not seem to like sparse matrixes. The `dual_du` field was a dense vector of length n x n x cache_size, and the `get_tmp` call led to further allocations trying to restructure the sparse matrix from the vector. Luckily there is the FixedSizeDiffCache that helps here: https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache

This retains the sparsity in the dual, and returns a `ReinterpretArray` from `get_tmp` during autodiff. To avoid materializing this reinterpretarray I needed to additionally fill the parent array with zeros rather than the array itself.

There is another unrelated performance fix here, and that is to concretely type the Parameter struct, by adding type parameters from its fields. Otherwise you have situations like

```
struct A
    a::Vector
end
```

where the compiler doesn't know the element type of the Vector, so it can perform less optimizations. The solution:

```
struct A{T}
    a::Vector{T}
end
```

Finally I consistently added AbstractVector/Matrix argument type annotations to ensure the ReinterpretArray could enter everywhere. And I renamed the functions to formulate flows to `formulate_flow`, to make it easier to separate them from the other `formulate!` methods.
@visr visr requested a review from Hofer-Julian September 8, 2023 14:00
visr added a commit that referenced this pull request Sep 8, 2023
`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The Parameters contain both, and therefore it was free to construct a `Dictionary` in a frequently called function like `get_level`. However with autodiff, the values could be a ReinterpretArray with Duals instead of just a Vector. This meant that on Dictionary creation it would convert the ReinterpretArray to a Vector, leading to many allocations.

This is on top of #581. After that, this was responsible for 94% of the time spent. With this PR that goes down to about 2%, leading to a nice little speedup.
@visr visr added the performance Relates to runtime performance or convergence label Sep 8, 2023
Copy link
Contributor

@Hofer-Julian Hofer-Julian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive find! Also, thanks for the detailed explanation :D

@Hofer-Julian Hofer-Julian merged commit 8079781 into main Sep 8, 2023
18 checks passed
@Hofer-Julian Hofer-Julian deleted the fixedsize branch September 8, 2023 18:26
visr added a commit that referenced this pull request Sep 8, 2023
`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The
Parameters contain both, and therefore it was free to construct a
`Dictionary` in a frequently called function like `get_level`. However
with autodiff, the values could be a ReinterpretArray with Duals instead
of just a Vector. This meant that on Dictionary creation it would
convert the ReinterpretArray to a Vector, leading to many allocations.

This is on top of #581. After
that, this was responsible for 94% of the time spent. With this PR that
goes down to about 2%, leading to a nice little speedup.

---------

Co-authored-by: Hofer-Julian <[email protected]>
visr added a commit that referenced this pull request Sep 11, 2023
When profiling runs with the default `autodiff=true`, this line was
responsible for 35% of the time and almost all allocations:

https://github.com/Deltares/Ribasim/blob/b3eb044a722d1655c5465bafe50951b75fe960d6/core/src/solve.jl#L1002

With this PR that drops down to 0%.

`connectivity.flow` is a sparse matrix, but the DiffCache does not seem
to like sparse matrixes. The `dual_du` field was a dense vector of
length n x n x cache_size, and the `get_tmp` call led to further
allocations trying to restructure the sparse matrix from the vector.
Luckily there is the FixedSizeDiffCache that helps here:
https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache

This retains the sparsity in the dual, and returns a `ReinterpretArray`
from `get_tmp` during autodiff. To avoid materializing this
reinterpretarray I needed to additionally fill the parent array with
zeros rather than the array itself.

There is another unrelated performance fix here, and that is to
concretely type the Parameter struct, by adding type parameters from its
fields. Otherwise you have situations like

```julia
struct A
    a::Vector
end
```

where the compiler doesn't know the element type of the Vector, so it
can perform less optimizations. The solution:

```julia
struct A{T}
    a::Vector{T}
end
```

Finally I consistently added AbstractVector/Matrix argument type
annotations to ensure the ReinterpretArray could enter everywhere. And I
renamed the functions to formulate flows to `formulate_flow`, to make it
easier to separate them from the other `formulate!` methods.
visr added a commit that referenced this pull request Sep 11, 2023
`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The
Parameters contain both, and therefore it was free to construct a
`Dictionary` in a frequently called function like `get_level`. However
with autodiff, the values could be a ReinterpretArray with Duals instead
of just a Vector. This meant that on Dictionary creation it would
convert the ReinterpretArray to a Vector, leading to many allocations.

This is on top of #581. After
that, this was responsible for 94% of the time spent. With this PR that
goes down to about 2%, leading to a nice little speedup.

---------

Co-authored-by: Hofer-Julian <[email protected]>
visr added a commit that referenced this pull request Sep 14, 2023
When profiling runs with the default `autodiff=true`, this line was
responsible for 35% of the time and almost all allocations:

https://github.com/Deltares/Ribasim/blob/b3eb044a722d1655c5465bafe50951b75fe960d6/core/src/solve.jl#L1002

With this PR that drops down to 0%.

`connectivity.flow` is a sparse matrix, but the DiffCache does not seem
to like sparse matrixes. The `dual_du` field was a dense vector of
length n x n x cache_size, and the `get_tmp` call led to further
allocations trying to restructure the sparse matrix from the vector.
Luckily there is the FixedSizeDiffCache that helps here:
https://docs.sciml.ai/PreallocationTools/stable/#FixedSizeDiffCache

This retains the sparsity in the dual, and returns a `ReinterpretArray`
from `get_tmp` during autodiff. To avoid materializing this
reinterpretarray I needed to additionally fill the parent array with
zeros rather than the array itself.

There is another unrelated performance fix here, and that is to
concretely type the Parameter struct, by adding type parameters from its
fields. Otherwise you have situations like

```julia
struct A
    a::Vector
end
```

where the compiler doesn't know the element type of the Vector, so it
can perform less optimizations. The solution:

```julia
struct A{T}
    a::Vector{T}
end
```

Finally I consistently added AbstractVector/Matrix argument type
annotations to ensure the ReinterpretArray could enter everywhere. And I
renamed the functions to formulate flows to `formulate_flow`, to make it
easier to separate them from the other `formulate!` methods.
visr added a commit that referenced this pull request Sep 14, 2023
`Dictionary` uses `Indices{I}` as keys, and `Vector{T}` as values. The
Parameters contain both, and therefore it was free to construct a
`Dictionary` in a frequently called function like `get_level`. However
with autodiff, the values could be a ReinterpretArray with Duals instead
of just a Vector. This meant that on Dictionary creation it would
convert the ReinterpretArray to a Vector, leading to many allocations.

This is on top of #581. After
that, this was responsible for 94% of the time spent. With this PR that
goes down to about 2%, leading to a nice little speedup.

---------

Co-authored-by: Hofer-Julian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Relates to runtime performance or convergence
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants