Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce OrderedSets #834

Merged
merged 1 commit into from
May 20, 2024
Merged

Introduce OrderedSets #834

merged 1 commit into from
May 20, 2024

Conversation

termi-official
Copy link
Member

@termi-official termi-official commented Nov 1, 2023

Title should say everything. This should increase cache locality for some problems (e.g. when using our generated grids) and we now have the guarantee that the iteration order for subdofhandlers and faces can be precisely controlled.

Closes #631 .

@termi-official
Copy link
Member Author

Okay, even with this there is still a quite big performance hit in using the SubDofHandler.

using Ferrite, SparseArrays, BenchmarkTools
grid = generate_grid(Quadrilateral, (1000, 1000));

ip = Lagrange{RefQuadrilateral, 1}()
qr = QuadratureRule{RefQuadrilateral}(2)
cellvalues = CellValues(qr, ip);

dh = DofHandler(grid)
add!(dh, :u, ip)
close!(dh);

K = create_sparsity_pattern(dh)

ch = ConstraintHandler(dh);

∂Ω = union(
    getfaceset(grid, "left"),
    getfaceset(grid, "right"),
    getfaceset(grid, "top"),
    getfaceset(grid, "bottom"),
);

dbc = Dirichlet(:u, ∂Ω, (x, t) -> 0)
add!(ch, dbc);

close!(ch)

function assemble_element!(Ke::Matrix, fe::Vector, cellvalues::CellValues)
    n_basefuncs = getnbasefunctions(cellvalues)
    ## Reset to 0
    fill!(Ke, 0)
    fill!(fe, 0)
    ## Loop over quadrature points
    for q_point in 1:getnquadpoints(cellvalues)
        ## Get the quadrature weight= getdetJdV(cellvalues, q_point)
        ## Loop over test shape functions
        for i in 1:n_basefuncs
            δu  = shape_value(cellvalues, q_point, i)
            ∇δu = shape_gradient(cellvalues, q_point, i)
            ## Add contribution to fe
            fe[i] += δu *## Loop over trial shape functions
            for j in 1:n_basefuncs
                ∇u = shape_gradient(cellvalues, q_point, j)
                ## Add contribution to Ke
                Ke[i, j] += (∇δu  ∇u) *end
        end
    end
    return Ke, fe
end

function assemble_global(cellvalues::CellValues, K::SparseMatrixCSC, dh::DofHandler)
    ## Allocate the element stiffness matrix and element force vector
    n_basefuncs = getnbasefunctions(cellvalues)
    Ke = zeros(n_basefuncs, n_basefuncs)
    fe = zeros(n_basefuncs)
    ## Allocate global force vector f
    f = zeros(ndofs(dh))
    ## Create an assembler
    assembler = start_assemble(K, f)
    ## Loop over all cels
    for cell in CellIterator(dh)
        ## Reinitialize cellvalues for this cell
        reinit!(cellvalues, cell)
        ## Compute element contribution
        assemble_element!(Ke, fe, cellvalues)
        ## Assemble Ke and fe into K and f
        assemble!(assembler, celldofs(cell), Ke, fe)
    end
    return nothing
end


function assemble_global2(cellvalues::CellValues, K::SparseMatrixCSC, dh::DofHandler)
    ## Allocate the element stiffness matrix and element force vector
    n_basefuncs = getnbasefunctions(cellvalues)
    Ke = zeros(n_basefuncs, n_basefuncs)
    fe = zeros(n_basefuncs)
    ## Allocate global force vector f
    f = zeros(ndofs(dh))
    ## Create an assembler
    assembler = start_assemble(K, f)
    ## Loop over all cels
    for cell in CellIterator(dh.subdofhandlers[1])
        ## Reinitialize cellvalues for this cell
        reinit!(cellvalues, cell)
        ## Compute element contribution
        assemble_element!(Ke, fe, cellvalues)
        ## Assemble Ke and fe into K and f
        assemble!(assembler, celldofs(cell), Ke, fe)
    end
    return nothing
end

@btime assemble_global(cellvalues, K, dh); # 494.943 ms (12 allocations: 7.65 MiB)
@btime assemble_global2(cellvalues, K, dh); # 655.668 ms (15 allocations: 7.65 MiB)

cc @kimauth

@KnutAM
Copy link
Member

KnutAM commented Nov 1, 2023

Did a quick benchmark, and I think this check:

@assert i in sdh.cellset

Is the reason for the worse performance. At least it would make sense to provide a fast unsafe path IMO (or, a quick solution for CellCache is to just pass the sdh.dh) instead. Theoretically, using the SubDofHandler could be faster by one lookup, because ndofs_per_cell is just one value...

image

@termi-official
Copy link
Member Author

Thanks for the pointer. If the performance issue is nothing conceptual, then I would leave the PR as it is (+- updating the deps) and fixing it is left for another PR.

@codecov-commenter
Copy link

codecov-commenter commented Nov 1, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.86%. Comparing base (ccf583d) to head (bcae4a4).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #834      +/-   ##
==========================================
+ Coverage   93.84%   93.86%   +0.01%     
==========================================
  Files          36       36              
  Lines        5293     5310      +17     
==========================================
+ Hits         4967     4984      +17     
  Misses        326      326              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@termi-official
Copy link
Member Author

termi-official commented Nov 2, 2023

Did a quick benchmark, and I think this check:

@assert i in sdh.cellset

Is the reason for the worse performance. At least it would make sense to provide a fast unsafe path IMO (or, a quick solution for CellCache is to just pass the sdh.dh) instead.

Confirmed locally that dropping this check on this branch brings the performance close to each other:

julia> @btime assemble_global(cellvalues, K, dh); # 494.943 ms (12 allocations: 7.65 MiB)
  478.872 ms (12 allocations: 7.65 MiB)

julia> @btime assemble_global2(cellvalues, K, dh); # 655.668 ms (15 allocations: 7.65 MiB)
  499.271 ms (15 allocations: 7.65 MiB)

where the comment is the reference timing with the check included. Note that the timings on my machine are not super stable. Some of the remaining time difference can be attributed to the iteration of the OrderedSet, which increases the memory pressure. Hence I think it is still worth to investigate the use of ranges in the iterators. Maybe we can add a boolean to the SubDofHandler with information about having a continuous range or not? Or we can take another look at #625

Theoretically, using the SubDofHandler could be faster by one lookup, because ndofs_per_cell is just one value...

If that is the case, then we should do it and force the CellIterator to iterate over the SubDofHandler in the case that there is only 1 subdofhandler.

@termi-official
Copy link
Member Author

@termi-official termi-official added the awaiting review PR is finished from the authors POV, waiting for feedback label Nov 14, 2023
@termi-official termi-official added this to the v1.0.0 milestone May 13, 2024
@fredrikekre fredrikekre requested a review from KnutAM May 19, 2024 13:38
@fredrikekre fredrikekre force-pushed the do/sdh-orderedsets branch 2 times, most recently from bee3d7b to fcb5dee Compare May 19, 2024 13:42
@fredrikekre
Copy link
Member

Fixed up after #914.

@fredrikekre fredrikekre removed the awaiting review PR is finished from the authors POV, waiting for feedback label May 19, 2024
Copy link
Member

@KnutAM KnutAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if we could make this somehow such that changing the internal representation in the future won't be breaking, but I guess that is hard.

I'm just stuck with a feeling that this isn't the optimal solution, but I also don't have a better alternative right now.

src/iterators.jl Outdated Show resolved Hide resolved
src/iterators.jl Outdated Show resolved Hide resolved
src/Dofs/ConstraintHandler.jl Outdated Show resolved Hide resolved
@KnutAM
Copy link
Member

KnutAM commented May 19, 2024

Would probably be nice to automatically sort the sets when adding a regular set to the grid, but can be done later.

@fredrikekre
Copy link
Member

fredrikekre commented May 19, 2024

@KnutAM something isn't right here with the porous media example (probably something with FerriteMeshParser?), can you have a look?

@KristofferC
Copy link
Collaborator

FWIW, there have been quite a few improvements to the Base Dict that has not been carried over to the OrderedDict. I am not sure this matters in practice but:

  • If someone is interested it might be worth trying to bring those improvements over to OrderedDict.
  • It is worth benchmarking again on newer Julia versions (1.11).

@termi-official
Copy link
Member Author

termi-official commented May 20, 2024

I'm just stuck with a feeling that this isn't the optimal solution, but I also don't have a better alternative right now.

I totally agree here on both points. I have tried different designs an none of them were really satisfactory to me.

Would probably be nice to automatically sort the sets when adding a regular set to the grid, but can be done later.

Why? Practially this only makes sense if the underlying grid has some ordering as in our generated ones and if your subdomain also has to follow this order. In generated meshes like e.g. the ones from Gmsh the optimal ordering is almost always not the ordering imposed by the numbering of the elements and you can mess up the performance if you sort them.

* It is worth benchmarking again on newer Julia versions (1.11).

The Set or the OrderedSet version?

@fredrikekre
Copy link
Member

I totally agree here on both points. I have tried different designs an none of them were really satisfactory to me.

What's wrong with this? We can can relax docs to say that getcellset etc return a collection instead of explicitly OrderedSet if you want the freedom to change it in the future.

Would probably be nice to automatically sort the sets when adding a regular set to the grid

See the comments in the code. addcellset! etc loop over the cells in order so the order of the added set will match. If you bring your own collection then the order is preserved.

@KnutAM
Copy link
Member

KnutAM commented May 20, 2024

something isn't right here with the porous media example (probably something with FerriteMeshParser?), can you have a look?

Hopefully 14d8800 fixes it. I also tested locally that it works without the overload with Ferrite-FEM/FerriteMeshParser.jl#43. That is a breaking release for FerriteMeshParser.jl though, so need to validate that no other breaking changes are required to support Ferrite v1 before merging and releasing. Once that's done, I'll PR removing that hack, and do the same for FerriteGmsh.

@KnutAM
Copy link
Member

KnutAM commented May 20, 2024

We can can relax docs to say that getcellset etc return a collection instead of explicitly OrderedSet if you want the freedom to change it in the future.

I think it would make sense to say that it returns an ordered collection as I think this PR shows that we want to keep that.

Would probably be nice to automatically sort the sets when adding a regular set to the grid

See the comments in the code. addcellset! etc loop over the cells in order so the order of the added set will match. If you bring your own collection then the order is preserved.

Right, that makes sense, nvm!

What's wrong with this?

I thought OrderedSet stored the information twice compared to a Set, but I seem to be wrong, so maybe this is the optimal approach (of course, when possible a UnitRange is better but not always applicable and Vector is smaller but has slow in).

struct Index
    t::NTuple{2,Int}
end
cells = rand(1:10000, 1000);
facets = rand(1:4, 1000);
inds = [Index((c, f)) for (c, f) in zip(cells, facets)];

julia> Base.summarysize.((inds, Set(inds), OrderedSet(inds)))
(16040, 35008, 32320)

@fredrikekre
Copy link
Member

If you need a vector we can expose the internal vector of the set.

This patch changes the use of `Set` to `OrderedSet` in the grid. This
means that e.g. loops over these sets follow the original specified
order. This is important for performance since it can reduce
cache-misses, for example. Fixes #631. Closes #654.

Co-authored-by: Fredrik Ekre <[email protected]>
Co-authored-by: Dennis Ogiermann <[email protected]>
@fredrikekre fredrikekre merged commit 15db94b into master May 20, 2024
9 checks passed
@fredrikekre fredrikekre deleted the do/sdh-orderedsets branch May 20, 2024 16:04
@termi-official
Copy link
Member Author

IIRC one open issue here is that we cannot directly parallelize over the ordered set. We want to add an how to for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use OrderedSet for cellsets
5 participants