Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify linear solvers #2706

Closed
wants to merge 9 commits into from
Closed

Unify linear solvers #2706

wants to merge 9 commits into from

Conversation

atgeirr
Copy link
Member

@atgeirr atgeirr commented Jul 10, 2020

This tries to reduce the linear solver usage to essentially either using the GPU or the FlexibleSolver.

@atgeirr atgeirr force-pushed the unify-linsolve branch 2 times, most recently from 1b4cb20 to 85de88a Compare July 31, 2020 14:25
@atgeirr atgeirr marked this pull request as ready for review August 4, 2020 10:34
@atgeirr
Copy link
Member Author

atgeirr commented Aug 4, 2020

This is now ready for review. I have tried to eliminate a lot of complexity (and a decent bit of compilation time), and to that end I have made two perhaps controversial choices:

  • Made the FlexibleSolver the default and only alternative to the GPU solver. No change in results with default (ilu0) option, as already established.
  • Require "--owner-cells-first=true", this is already the default. However the possibility of having a general cell order may be good to have in the future (adaptive grids perhaps?), so I think it is worth discussing.

I also had to re-add the makeOverlapRowsInvalid() method, and run it for all parallel runs. I was a little confused about this, and assumed that it would not be necessary with ILU0 and --owner-cells-first=true at least, but apparently I was wrong. I did not quite understand how (and if) this modification was done in the master branch version?

There are also quite a bit of refactoring here, so the diffs may not be all that useful for reviewing. If requested I can try to split this into more PRs to simplify the process.

@atgeirr
Copy link
Member Author

atgeirr commented Aug 4, 2020

jenkins build this please

@blattms
Copy link
Member

blattms commented Aug 4, 2020 via email

@atgeirr
Copy link
Member Author

atgeirr commented Aug 4, 2020

Please post benchmark results before and after the changes.

Sure! Did you mean compile times or simulation times? Or both?

@atgeirr
Copy link
Member Author

atgeirr commented Aug 4, 2020

benchmark please

@atgeirr
Copy link
Member Author

atgeirr commented Aug 4, 2020

Build times (16 HW threads, building with make -j 15, not including cmake times) for release mode:

Build variant Master This PR
Clean build 4m24s 3m59s
touch opm/simulators/wells/StandardWell.hpp 3m34s 3m02s
touch opm/simulators/linalg/PreconditionerFactory.hpp 1m21s 1m20s

Build times (16 HW threads, building with make -j 15, not including cmake times) for debug mode:

Build variant Master This PR
Clean build 2m05s 1m53s
touch opm/simulators/wells/StandardWell.hpp 1m29s 1m21s
touch opm/simulators/linalg/PreconditionerFactory.hpp 41s 41s

@atgeirr
Copy link
Member Author

atgeirr commented Aug 4, 2020

Results from running Norne with default settings and 8 processes. Master branch:

Number of MPI processes:      8
Threads per MPI process:      1
Total time (seconds):         210.217
Solver time (seconds):        210.165
 Assembly time (seconds):     62.3946 (Failed: 1.35474; 2.17125%)
 Linear solve time (seconds): 117.127 (Failed: 2.32797; 1.98756%)
 Linear solve setup time (seconds): 17.5982 (Failed: 0.478168; 2.71714%)
 Update time (seconds):       3.31908 (Failed: 0.0760609; 2.29163%)
 Output write time (seconds): 2.18266
Overall Well Iterations:      928 (Failed: 5; 0.538793%)
Overall Linearizations:       1953 (Failed: 42; 2.15054%)
Overall Newton Iterations:    1618 (Failed: 42; 2.5958%)
Overall Linear Iterations:    25444 (Failed: 463; 1.81968%)

This PR:

Number of MPI processes:      8
Threads per MPI process:      1
Total time (seconds):         208.417
Solver time (seconds):        208.366
 Assembly time (seconds):     62.2715 (Failed: 1.34684; 2.16285%)
 Linear solve time (seconds): 115.238 (Failed: 2.20138; 1.91029%)
 Linear solve setup time (seconds): 17.1001 (Failed: 0.434218; 2.53928%)
 Update time (seconds):       3.29661 (Failed: 0.0742074; 2.25102%)
 Output write time (seconds): 2.17215
Overall Well Iterations:      928 (Failed: 5; 0.538793%)
Overall Linearizations:       1953 (Failed: 42; 2.15054%)
Overall Newton Iterations:    1618 (Failed: 42; 2.5958%)
Overall Linear Iterations:    25444 (Failed: 463; 1.81968%)

Results are identical. I think the time differences are random, and do not indicate any advantage of the PR (it should not have any).

@alfbr
Copy link
Member

alfbr commented Aug 4, 2020

Seems we still have issues with the benchmarks, hopefully resolved tomorrow.

Copy link
Member

@blattms blattms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem to me like all the parallel optimzations done by @andrthu are lost with this. Unless I am mistaking these actually gave quite a performance boost on parallel production runs. Hence his changes must be retained. E.g. the default should be ownersFirst_==true which is what it is in master, but in this branch it is the opposite.

It seems like easily choosing cpr or amg with one option is gone now. I think this is a decrease in usability.
Maybe should "cpr" (defaulting to true impes) and "amg" to --linear-solver-configuration?

#endif
}
else
{
std::string conf = p.linear_solver_configuration_;
// Support old UseCpr if not configuration was set
if (!EWOMS_PARAM_IS_SET(TypeTag, std::string, LinearSolverConfiguration) && p.use_cpr_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a user select CPR now (formerly --use-cpr=true was all it took)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to make everything follow --linear-solver-configuration (or at least move in that direction).

@@ -191,14 +194,14 @@ namespace Opm
newton_use_gmres_ = EWOMS_GET_PARAM(TypeTag, bool, UseGmres);
require_full_sparsity_pattern_ = EWOMS_GET_PARAM(TypeTag, bool, LinearSolverRequireFullSparsityPattern);
ignoreConvergenceFailure_ = EWOMS_GET_PARAM(TypeTag, bool, LinearSolverIgnoreConvergenceFailure);
linear_solver_use_amg_ = EWOMS_GET_PARAM(TypeTag, bool, UseAmg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I select AMG as preconditioner (formerly --use-amg), I think that proved good for some 2-phase problems of @totto82?

Copy link
Member Author

@atgeirr atgeirr Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, by using the JSON file option. I was not aware this actually had good use cases, I suggest we make any such particular choices available by --linear-solver-configuration=some_string, just like for "ilu0", "cpr_quasiimpes" and "cpr_trueimpes".

Edit: I see this is the same that you suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is still missing?

If we skip --use-cpr like in this PR then to the very least I would suggest having "cpr" here.
Biut "--linear-solver-configuration=cpr" is stil a lot more to type than "--use-cpr"

Comment on lines +135 to +137
// cpr_ilu_milu_ = MILU_VARIANT::ILU;
// cpr_ilu_redblack_ = false;
// cpr_ilu_reorder_sphere_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these really supported by the flexible solvers?

Comment on lines -196 to -197
noGhostAdjacency();
setGhostsInNoGhost(*noGhostMat_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a performance reason in paralell for doing. Are you sure this does not apply anymore? Please back up with some numbers.

Are we perhaps now always using the OwnerCellsFirst approach? In that case we need to remove the option from opm-grid, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed trying to always use OwnerCellsFirst, however I am unsure if I should reverse that, as discussed briefly in the description.

OpmLog::warning("OwnerCellsFirst option is true, but ignored.");
const bool ownersFirst = EWOMS_GET_PARAM(TypeTag, bool, OwnerCellsFirst);
if (!ownersFirst) {
const std::string msg = "The linear solver no longer supports --owner-cells-first=false.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this a nogo for performance reasons. I might have said before that ownersFirst==true should be the only option in the long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confuses me: with this PR ownersFirst==true is the only option, unless I made some 180 degree mistake somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I misread this.
But if we only support true, shouldn't the option be deleted? Having it confuses me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not go as far as removing the option, as I became unsure if that would be acceptable. But if there are no uses for it (adaptivity struck me as a possible use) I also think it is better to remove it.

boost::property_tree::write_json(os, prm_, true);
OpmLog::note(os.str());
}
interiorCellNum_ = detail::numMatrixRowsToUseInSolver(simulator_.vanguard().grid(), true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now actually doing grid.numCells()

Copy link
Member Author

@atgeirr atgeirr Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so. This is the function called, with the second argument being true:

    template <class Grid>
    size_t numMatrixRowsToUseInSolver(const Grid& grid, bool ownerFirst)
    {
        size_t numInterior = 0;
        if (!ownerFirst || grid.comm().size()==1)
            return grid.numCells();
        const auto& gridView = grid.leafGridView();
        auto elemIt = gridView.template begin<0>();
        const auto& elemEndIt = gridView.template end<0>();

        // loop over cells in mesh
        for (; elemIt != elemEndIt; ++elemIt) {

            // Count only the interior cells.
            if (elemIt->partitionType() == Dune::InteriorEntity) {
                numInterior++;
            }
        }

        return numInterior;
    }

Looks fine to me? Could be simplified if we go with this though, as we would never call it with a false argument though.

@@ -678,87 +331,34 @@ DenseMatrix transposeDenseMatrix(const DenseMatrix& M)

void prepareFlexibleSolver()
{
// Decide if we should recreate the solver or just do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the scaling stuff (DRS et all) is from your colleagues, I trust you here. Was wondering myself whether this is used/useful.

Comment on lines -749 to -750
OPM_THROW(std::runtime_error, "In parallel, the flexible solver requires "
"--owner-cells-first=true when --matrix-add-well-contributions=false is used.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this needed anymore? ownersFirst is now alway false with your changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment, I meant for it to be true always.

@blattms
Copy link
Member

blattms commented Aug 6, 2020

According to #2220 the savings (besides noise) started from 16 processes.

Are will still doing #2205?

@atgeirr
Copy link
Member Author

atgeirr commented Aug 6, 2020

Are will still doing #2205?

I have to admit I was confused by this code, and may have deleted too much. I assumed that the no-ghost-matrix was only created and used when the ordering was not OwnerCellsFirst, so it would now be not needed, but I may have misread. So will the manipulations made with the no-ghost-matrix still have any effect with the OwnerCellsFirst always true?

@blattms
Copy link
Member

blattms commented Aug 6, 2020

If ownerFirst is always true, then the additions #2205 can go and in addition both findOverlapAndInterior and makeOverlapRowsInvalid should not be needed anymore. Not 100% sure, so we should test that. Overlap rows are not treated by the operator anymore, Not sure about the preconditioner.

Concerning that: The ILU had knowledge whether this was ownerFirst_ or not and adapted accordingly. Is that still true for the flexible solvers?

@atgeirr
Copy link
Member Author

atgeirr commented Aug 6, 2020

makeOverlapRowsInvalid should not be needed anymore. Not 100% sure, so we should test that. Overlap rows are not treated by the operator anymore, Not sure about the preconditioner.

I found that I needed this to get identical results as the current master. I think it makes sense, even if the operator does not see the rows, the preconditioner will.

Concerning that: The ILU had knowledge whether this was ownerFirst_ or not and adapted accordingly. Is that still true for the flexible solvers?

Yes. Although it is not elegant: I added a method to the PreconditionerFactory

    template <class CommArg>
    static size_t interiorIfGhostLast(const CommArg& comm)

that returns the number of interior cells or all cells, just what the constructor of the ParallelOverlappingILU0 class needs. So this is (re-)calculated every time we (re-)construct the preconditioner. It did not seem to cause any performance penalty that I could discover (it is a local operation, no communication). We could add another argument to PreconditionerFactory::create(), just like the weight function, but I am thinking about ways to communicate such information to the preconditioner construction in a more generic way. One way could be to pass for example const std::map<std::string, std::any>& extra_info through to the preconditioner, then cpr could look for extra_info["weightfunction"], the parallel ILU0 could look for extra_info["interior_size"] etc. If something needed is not found, we can give a helpful error message.

@atgeirr
Copy link
Member Author

atgeirr commented Aug 11, 2020

benchmark please

@alfbr
Copy link
Member

alfbr commented Aug 12, 2020

There were some issues related to MariaDB and the github API. Hopefully resolved now, so let's try again.

@alfbr
Copy link
Member

alfbr commented Aug 12, 2020

benchmark please

@alfbr
Copy link
Member

alfbr commented Aug 12, 2020

There seems to be a build failure for this PR preventing the benchmark.

@atgeirr
Copy link
Member Author

atgeirr commented Aug 12, 2020

There seems to be a build failure for this PR preventing the benchmark.

A bit strange since the Jenkins test passed, but perhaps things have changed on master in incompatible ways. I will rerun Jenkins to see if that gives a build error.

@atgeirr
Copy link
Member Author

atgeirr commented Aug 12, 2020

jenkins build this serial please

@blattms
Copy link
Member

blattms commented Aug 12, 2020

In file included from /home/mblatt/src/dune/opm-2.6/opm-simulators/ebos/ebos.hh:35:0,
                 from /home/mblatt/src/dune/opm-2.6/opm-simulators/ebos/ebos_blackoil.cc:30:
/home/mblatt/src/dune/opm-2.6/opm-simulators/opm/simulators/linalg/ISTLSolverEbos.hpp: In member function ‘bool Opm::ISTLSolverEbos<TypeTag>::solve(Opm::ISTLSolverEbos
<TypeTag>::Vector&)’:
/home/mblatt/src/dune/opm-2.6/opm-simulators/opm/simulators/linalg/ISTLSolverEbos.hpp:258:36: error: ‘simulator’ was not declared in this scope
                     if (use_gpu && simulator.gridView().comm().rank() == 0) {
                                    ^~~~~~~~~
/home/mblatt/src/dune/opm-2.6/opm-simulators/opm/simulators/linalg/ISTLSolverEbos.hpp:258:36: note: suggested alternative: ‘simulator_’
                     if (use_gpu && simulator.gridView().comm().rank() == 0) {
                                    ^~~~~~~~~
                                    simulator_

@atgeirr
Copy link
Member Author

atgeirr commented Aug 12, 2020

Thanks for the error message, a trivial bug for sure! (But hard to find without GPU compilation.) I'll update.

@atgeirr
Copy link
Member Author

atgeirr commented Aug 12, 2020

Is it not strange that Jenkins did not complain? I assumed that it also compiled the GPU version?

@atgeirr
Copy link
Member Author

atgeirr commented Aug 12, 2020

jenkins build this please

@atgeirr
Copy link
Member Author

atgeirr commented Aug 12, 2020

benchmark please

@atgeirr
Copy link
Member Author

atgeirr commented Aug 13, 2020

Seems that the benchmark did not go through? Could someone with a CUDA or OpenCL setup test this PR in case I have more silly mistakes in the code parts I do not compile locally?

@alfbr
Copy link
Member

alfbr commented Aug 13, 2020

Seems that the benchmark did not go through? Could someone with a CUDA or OpenCL setup test this PR in case I have more silly mistakes in the code parts I do not compile locally?

Actually it did go through, it was just not reported. During testing the reporting was turned off to avoid spamming. Sorry for the confusion. These were the results:

Test Configuration Relative
opm-git OPM Benchmark: flow_mpi_extra - Threads: 1 1.03
opm-git OPM Benchmark: flow_mpi_extra - Threads: 8 0.984
opm-git OPM Benchmark: flow_mpi_norne - Threads: 1 1.002
opm-git OPM Benchmark: flow_mpi_norne - Threads: 8 0.978
  • Speed-up = Total time master / Total time pull request. Above 1.0 is an improvement. *

@atgeirr
Copy link
Member Author

atgeirr commented Aug 13, 2020

Thanks for the results! The slight performance reduction for 8 threads could perhaps be noise, but it might not be. After discussion with @andrthu I think I have a better understanding of what is removed and not here: Results should be identical (as also indicated through testing), but there is one performance optimization lost: The "noGhostMatrix" would be built with no off-diagonal entries on ghost rows. However, this optimization would only affect the (no longer default) OwnerCellsFirst==false case. After discussion, the conclusion was that it is better to never put these entries in the jacobian in the first place, instead of manipulating the matrix afterwards. That should save the time for makeOverlapRowsInvalid() and also save a tiny bit of time for assembly. I'll experiment a little.

@alfbr
Copy link
Member

alfbr commented Aug 13, 2020

The slight performance reduction for 8 threads could perhaps be noise, but it might not be.

Indeed, the deviation is within the noise band. Then again, it may be real, so please do the experiments.

@blattms
Copy link
Member

blattms commented Aug 13, 2020 via email

@alfbr
Copy link
Member

alfbr commented Aug 13, 2020

I think I said this before: There is no reason for OwnerCellsFirst and if possible the option should be removed and the code cleaned accordingly.

Good point, this is maybe a good opportunity to clean it up.

@atgeirr
Copy link
Member Author

atgeirr commented Aug 13, 2020

Good point, this is maybe a good opportunity to clean it up.

Will amend. I think we should keep the possibility in opm-grid, but remove the option from Flow. Perhaps also add some assertations to guard against future mistakes.

@bska
Copy link
Member

bska commented Aug 13, 2020

I think we should keep the possibility in opm-grid

What is the cost/benefit ratio of doing so? I can easily imagine that keeping the possibility in place will incur at least some maintenance cost—e.g., one more branch that must be tested whenever we touch the code —and I don't expect benefit to be particularly large if we never run that code.

@blattms
Copy link
Member

blattms commented Aug 13, 2020 via email

@ytelses
Copy link

ytelses commented Aug 13, 2020

Benchmark result overview:

Test Configuration Relative
opm-git OPM Benchmark: flow_mpi_extra - Threads: 1 1.03
opm-git OPM Benchmark: flow_mpi_extra - Threads: 8 0.984
opm-git OPM Benchmark: flow_mpi_norne - Threads: 1 1.002
opm-git OPM Benchmark: flow_mpi_norne - Threads: 8 0.978
  • Speed-up = Total time master / Total time pull request. Above 1.0 is an improvement. *

View result details @ https://www.ytelses.com/opm/?page=result&id=

@blattms
Copy link
Member

blattms commented Aug 14, 2020

Unfortunately, this PR seems to be 5% slower on model 2 with 16 processors. I will run further tests.

@atgeirr
Copy link
Member Author

atgeirr commented Aug 14, 2020

I have now done some experiments with eliminating off-diagonal nonzeros on ghost rows, it worked quite well! I see a minor speedup, and it should allow this PR to be merged without fear of losing any optimization. See OPM/opm-models#623.

There is one detail worth mentioning: when using --matrix-add-well-contributions=true (still recommended to get good cpr performance), the well contributions can still result in ghost row off-diagonal elements. I looked briefly at eliminating this as well, but it would be a little trickier than the opm-models change, so I postponed that.

Unfortunately, this PR seems to be 5% slower on model 2 with 16 processors. I will run further tests.

I assume that is with default solver and options? Consider testing with the above-mentioned opm-models PR.

@blattms
Copy link
Member

blattms commented Aug 14, 2020 via email

@atgeirr
Copy link
Member Author

atgeirr commented Aug 14, 2020

IMHO all wells should only perforate interior cells, as each well is contained in the interior.

That sounds good, my mistake then. However my testing seemed to indicate that with cpr the makeOverlapRowsInvalid() call still made a difference. I interpreted that to mean that it was needed due to off-diagonal ghost elements from the wells. However, if they do not exist, the only effect of that call is setting the diagonal element to identity. So perhaps that is still a good idea!

@blattms
Copy link
Member

blattms commented Aug 17, 2020

I retract my statement witht the 5%. I dd more runs and on average the runtimes are the same.

@@ -309,7 +309,9 @@ public:
auto wellDofIt = dofVariables_.begin();
const auto& wellDofEndIt = dofVariables_.end();
for (; wellDofIt != wellDofEndIt; ++ wellDofIt) {
neighbors[wellGlobalDof].insert(wellDofIt->first);
if (wellDofIt->second.element.partitionType() == Dune::PartitionType::InteriorEntity) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this will be gone after rebasing. Is it needed for this PR?

@atgeirr
Copy link
Member Author

atgeirr commented Oct 10, 2020

Closing in favour of #2848.

@atgeirr atgeirr closed this Oct 10, 2020
@atgeirr atgeirr deleted the unify-linsolve branch October 15, 2020 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants