Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine documentation #139

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions docs/advanced_topics.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- J. Garcke. ‘Sparse Grids in a Nutshell’. In: Sparse Grids and Applications.
Ed. by J. Garcke, M. Griebel. Lecture Notes in Computational Science and Engineering.
Berlin, Heidelberg: Springer, 2013, pp. 57–80.
url: <https://ins.uni-bonn.de/media/public/publication-media/sparse_grids_nutshell_code.pdf>
- D. Pflüger. Spatially Adaptive Sparse Grids for High-Dimensional Problems.
Verlag Dr. Hut, 2010.
- W. Guo, Y. Cheng. ‘A Sparse Grid Discontinuous Galerkin Method for High-Dimensional
Expand All @@ -29,23 +30,19 @@
of sparse grid problems’. In: Proceedings of the IMACS International Symposium
on Iterative Methods in Linear Algebra: Brussels, Belgium, 2 - 4 April, 1991.
Ed. by P. de Groen, R. Beauwens. North Holland, 1992.
url: <https://ins.uni-bonn.de/media/public/publication-media/griesiam.ps.gz>
- M. Heene. ‘A Massively Parallel Combination Technique for the Solution of
High-Dimensional PDEs’. PhD thesis.
Institut für Parallele und Verteilte Systeme der Universität Stuttgart, 2018.
url: <http://dx.doi.org/10.18419/opus-9893>

## Variants of the Combination Technique in DisCoTec

### Truncated Combination Technique
### Adaptivity

DisCoTec by default uses the *truncated combination technique*, which sets a
minimum level for all component grids used.

- J. Benk, D. Pflüger. ‘Hybrid parallel solutions of the Black-Scholes PDE with
the truncated combination technique’. In: 2012 International Conference on
High Performance Computing Simulation (HPCS). July 2012.

### Adaptivity

Of the two ways of bringing adaptivity into the combination technique, DisCoTec
currently supports only one:
Static dimensional adaptivity can be achieved by providing the combination
Expand All @@ -58,6 +55,11 @@ If your application needs dynamic (= during runtime) or spatial adaptivity

- T. Gerstner, M. Griebel. ‘Dimension–adaptive tensor–product quadrature’.
In: Computing 71.1 (2003), pp. 65–87.
url: <https://ins.uni-bonn.de/media/public/publication-media/dimadapt.pdf>
- J. Benk, D. Pflüger. ‘Hybrid parallel solutions of the Black-Scholes PDE with
the truncated combination technique’. In: 2012 International Conference on
High Performance Computing Simulation (HPCS). July 2012.
url: <https://ieeexplore.ieee.org/abstract/document/6266992>
- C. Kowitz. ‘Applying the Sparse Grid Combination Technique in Linear Gyrokinetics’.
Dissertation. München: Technische Universität München, 2016.
- M. Obersteiner. ‘A spatially adaptive and massively parallel implementation of
Expand Down Expand Up @@ -149,9 +151,10 @@ The last reference below shows how conservation of mass and L2 stability is only
provided by the latter two.
In practice, we have observed that using hierarchical hat functions and long
combination intervals (many time steps per combination) is fine with relatively
laminar PDE solutions.
But in the turbulent regime, it becomes necessary to use the CDF wavelets and to
combine after every PDE solver time step to avoid numerical instability.
laminar PDE solutions
(both in fluid dynamics and plasma turbulence simulations).
However, in the turbulent regime, it becomes necessary to use the CDF wavelets
and to combine after every PDE solver time step to avoid numerical instability.

If you find yourself in need of higher orders of accuracy or conservation, you
could add higher-order CDF wavelets to `DistributedHierarchization.hpp`.
Expand Down Expand Up @@ -245,7 +248,7 @@ with LZ4.

### Using PDE Solvers Written In Other Programming Languages

Your functions need the same described interface and need to somehow expose it
Your functions need the same described interface and need to expose it
to the C++ compiler.
For Fortran, the SelalibTask shows how the [relevant functions](https://github.com/selalib/selalib/blob/main/simulations/parallel/bsl_vp_3d3v_cart_dd/sll_m_sim_bsl_vp_3d3v_cart_dd_slim_interface.F90)
can be [made accessible with `extern "C"`](https://github.com/SGpp/DisCoTec/blob/main/examples/selalib_distributed/src/SelalibTask.hpp).
49 changes: 47 additions & 2 deletions docs/combination_technique.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,58 @@ Taken together, all component grids form a sparse grid approximation, which can
be explicitly obtained by a linear superposition of the individual grid
functions, with the so-called combination coefficients.

The component grids
can be identified with their level vector $\vec{\ell} \in \mathbb{N}^d$ whose length
equals the dimensionality $d$ of the problem.
From it, one can derive the
grid's spacing $\vec{h}$ as $\vec{h} = \frac{1}{2^{\ell}}$,
assuming the domain is $[0,1)^d$.

![schematic of a combination scheme in 2D](../gfx/combischeme-2d.svg)

In this two-dimensional combination scheme, all combination coefficients are 1
and -1, respectively.
Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/handle/11682/14229),
adapted from Pollinger et al. [2023](https://dl.acm.org/doi/10.1145/3581784.3607036)).

For instance, a the $x$-direction level of 5 denotes a the grid spacing in the
$x$ dimension of $h_x = \frac{1}{32}$. Assuming periodicity, this means we use 32
points in dimension $x$, like illustrated by the rightmost component grid in the
above figure.

Here, we use the truncated combination technique, which sets a minimum and maximum
level, $\vec{\ell^\text{min}}$ and $\vec{\ell^\text{max}}$, to select the
component grids $\vec{\ell} \in \mathcal{I}$.
The set of all selected component grids $\mathcal{I}$ is referred to as the
"combination scheme".
For regularity, we assume a constant difference between $\vec{\ell^\text{min}}$
and $\vec{\ell^\text{max}}$ in each dimension, but [other schemes can be useful](./advanced_topics).
<!-- The resulting component grids will be $d$ simplex "layers" in the space of
level vectors $\vec{\ell}$. -->

One can then compute suitable combination coefficients $c_{\vec{\ell}}^c$
that are used to obtain a sparse grid function
$f_\text{SG}$ by way of a linear superposition of
functions $f_{\vec{\ell}}$ defined on the component grid:

$f_\text{SG} = \sum_{\vec{\ell} \in \mathcal{I}} c_{\vec{\ell}}^c\cdot f_{\vec{\ell}}$

In the two-dimensional combination scheme in the above figure, all combination
coefficients are 1 and -1, respectively.
The sparse grid is illustrated on the upper right.

This sparse grid function can be expected to be accurate in
$\mathcal{O}(h^2 \cdot \log(h^{d-1}))$, given some assumptions on the bounds of
mixed-dimension partial derivatives, where $N$ is the number of points corresponding
to the finest resolution occurring in the scheme.
At the same time, the number of points is only $\mathcal{O}(d\cdot N(\log N)^{d-1})$,
drastically less than the $\mathcal{O}(N^d)$ needed for a finely-resolved full
grid.
Thus, while the gains in accuracy per degrees of freedom are the most prominent for
high dimensionalities ($d \geq 4$), even two-dimensional schemes can benefit
significantly for fine resolutions.

<!-- In the most commonly used type of combination scheme, related to "regular" sparse
grids, there are always $d$ layers -->

Between time steps, the grids exchange data through an intermediate multi-scale
represenation, which is summarized as the "combination" step in DisCoTec.
Assuming a certain smoothness in the solution, this allows for a good
Expand Down
31 changes: 31 additions & 0 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,11 @@ source folder.

## Run the tests

Beware: Most of the tests in DisCoTec need significant resources
(eight to nine MPI processes,
file I/O, some performance tests when built in Release mode).
Make sure you run them on appropriate hardware.

To run the compiled tests, go to folder `tests` and make sure the correct MPI runtime
is loaded to your path, e.g. with

Expand Down Expand Up @@ -154,6 +159,32 @@ The exact format and naming in `ctparam` is not (yet) standardized, to allow
adaptation for different PDE solver applications.
Please refer to existing parameter files and example implementations.

The different examples intend to illustrate various aspects of DisCoTec:

- [`combi_example`](https://github.com/SGpp/DisCoTec/blob/main/examples/combi_example/):
manager-worker setup with a generic `TaskExample`
that interpolates a cosine function on the component grids
- [`combi_example_faults`](https://github.com/SGpp/DisCoTec/blob/main/examples/combi_example_faults/):
like `combi_example`, but with fault tolerance functionality
(requires that DisCoTec is built with the `DISCOTEC_ENABLEFT` flag)
- [`distributed_advection`](https://github.com/SGpp/DisCoTec/blob/main/examples/distributed_advection/):
manager-worker setup using a first-order upwinding advection solver,
illustrates how to evaluate the Monte-Carlo norms
(and error norms, as an analytical solution is given for this PDE problem)
- [`combi_workers_only`](https://github.com/SGpp/DisCoTec/blob/main/examples/combi_workers_only/):
like `distributed_advection`, but with worker-only setup;
illustrates the static task assignment required for worker-only scenarios
- [`distributed_third_level`](https://github.com/SGpp/DisCoTec/blob/main/examples/distributed_third_level/):
like `distributed_advection` and `combi_workers_only`;
adds logic for widely-distributed simulations
- GENE examples, starting with `gene_`: show the interaction with
the GENE gyrokinetic solver (using a complicated build/link setup
that we don't recommend to use any more, see next example instead)
- [`selalib_distributed`](https://github.com/SGpp/DisCoTec/blob/main/examples/selalib_distributed/):
shows the interaction with a Semi-Lagrangian Vlasov solver in SeLaLib:
a static library is built as part of SeLaLib, and linked through DisCoTec's
`Task` interface in `SelalibTask`.

(pinning-with-various-mpi-implementations)=
### Pinning With Various MPI Implementations

Expand Down
22 changes: 17 additions & 5 deletions docs/parallelism.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,17 @@ Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/h

All component grids are distributed to process groups (either statically, or
dynamically through the manager rank).
Each process group will be responsible for a set of component grids and the
solver instance belonging to each.
Within a process group and each grid, one can observe the exact same domain
decomposition parallelism as when using the solver directly.
(The domain decomposition needs to be the same for all solver instances.)

During a time step, the PDE solver step applies one or multiple time step updates
to the values in each component grid.
During the PDE solver time step and most of the combination step, MPI communication
only happens within the process groups.
Conversely, for the sparse grid reduction using the combination coefficients,
Conversely, for the sparse grid reduction using the combination coefficients $c_{\vec{\ell}}^c$,
MPI communication only happens between a rank and its colleagues in the other
process groups, e.g., rank 0 in group 0 will only talk to rank 0 in all other groups.
Thus, major bottlenecks arising from global communication can be avoided altogether.
Expand All @@ -39,21 +47,25 @@ If needed, the exact distribution of coordinates to ranks is set in the
Combining the two ways of scaling up, DisCoTec's scalability was demonstrated on
several machines, with the experiments comprising up to 524288 cores:

![timings for advection solver step on HAWK at various
![timings for 6-d advection solver step on HAWK at various
parallelizations](../gfx/times-solver-on-hawk.svg)
![timings for combination step on
HAWK at various parallelizations](../gfx/times-combination-on-hawk.svg)

We see the timings (in seconds) for the advection solver step and the
combination step, respectively.
We see the timings (in seconds) for the solver and combination steps respectively,
for an illustrative (6-D) PDE problem.
This weak scaling experiment used four OpenMP threads per rank, and starts with
one pg of four processes in the upper left corner.
The largest parallelization is 64 pgs of 2048 processes each.
Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/handle/11682/14229)).

This image also describes the challenges in large-scale experiments with DisCoTec:
If the process groups become too large, the MPI communication of the multiscale
Generally, the process group size can be chosen so that it matches the solver
and the desired resolutions well.
But if the process groups become too large, the MPI communication of the multiscale
transform starts to dominate the combination time.
Conversely, the number of process groups can be freely chosen, as long as there
is sufficient work for each group to do (one grid per group at a bare minimum).
If there are too many pgs, the combination reduction will dominate the
combination time.
However, the times required for the PDE solver stay relatively constant;
Expand Down
50 changes: 35 additions & 15 deletions docs/simple_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ int num_points_x = ...
int num_points_y = ...
int num_points_z = ...
int num_points_vx = ...
int num_points_vy = ...
int num_points_vz = ...
...
initial_function = ...
grid.initialize(
Expand All @@ -23,6 +25,8 @@ grid.initialize(
num_points_y,
num_points_z,
num_points_vx,
num_points_vy,
num_points_vz,
...
)
helper_data_structures.initialize(...)
Expand All @@ -39,6 +43,8 @@ properties = compute_properties(grid)
write_output(properties, grid)
```

This example assumes that your PDE solver uses three dimensions in both space and velocity.

For use with DisCoTec, we assume nestable power-of-two discretizations, i.e.,
where your grid spacing can be $2^{-l}$ for $l \in \mathbb{N}$.
For simplicity, in this tutorial we also assume you use periodic boundary
Expand All @@ -60,6 +66,8 @@ int num_points_x = ...
int num_points_y = ...
int num_points_z = ...
int num_points_vx = ...
int num_points_vy = ...
int num_points_vz = ...
...

- initial_function = ...
Expand All @@ -69,6 +77,8 @@ int num_points_vx = ...
- num_points_y,
- num_points_z,
- num_points_vx,
- num_points_vy,
- num_points_vz,
- ...
- )
- helper_data_structures.initialize(...)
Expand All @@ -77,6 +87,8 @@ int num_points_vx = ...
+ num_points_y,
+ num_points_z,
+ num_points_vx,
+ num_points_vy,
+ num_points_vz,
+ ...
+ )
float end_time = ...
Expand Down Expand Up @@ -112,7 +124,9 @@ C++ header `YourSolver.h`, roughly like this:

class YourSolver {
public:
YourSolver(int num_points_x, int num_points_y, int num_points_vx, int num_points_vy, ...);
YourSolver(int num_points_x, int num_points_y, int num_points_z,
int num_points_vx, int num_points_vy, int num_points_vz,
...);
//...rule of 5...

void run(double time_step);
Expand Down Expand Up @@ -143,9 +157,12 @@ With `YourSolver`, that will be as simple as:

class YourTask : public combigrid::Task {
public:
YourTask(combigrid::LevelVector& l, const std::vector<combigrid::BoundaryType>& boundary,
combigrid::real coeff, combigrid::real dt)
: Task(4, l, boundary, coeff, nullptr, nullptr), sim_(nullptr), dfg_(nullptr), dt_(dt) {}
YourTask(combigrid::LevelVector& l,
const std::vector<combigrid::BoundaryType>& boundary,
combigrid::real coeff,
combigrid::real dt)
: Task(4, l, boundary, coeff, nullptr, nullptr),
sim_(nullptr), dfg_(nullptr), dt_(dt) {}

virtual ~YourTask(){
if (sim_ != nullptr){
Expand All @@ -162,17 +179,20 @@ class YourTask : public combigrid::Task {
const auto& l = this->getLevelVector();
int num_points_x = combigrid::powerOfTwoByBitshift(l[0]);
int num_points_y = combigrid::powerOfTwoByBitshift(l[1]);
int num_points_vx = combigrid::powerOfTwoByBitshift(l[2]);
int num_points_vy = combigrid::powerOfTwoByBitshift(l[3]);
int num_points_z = combigrid::powerOfTwoByBitshift(l[2]);
int num_points_vx = combigrid::powerOfTwoByBitshift(l[3]);
int num_points_vy = combigrid::powerOfTwoByBitshift(l[4]);
int num_points_vz = combigrid::powerOfTwoByBitshift(l[5]);

// this is the number of MPI processes in each dimension --
// if all are 1, we are only using the parallelism between grids
std::vector<int> p = {1, 1, 1, 1};
std::vector<int> p = {1, 1, 1, 1, 1, 1};

// if using MPI within your solver,
// pass p and the lcomm communicator to sim_, too
sim_ =
std::make_unique<YourSolver>(num_points_x, num_points_y, num_points_vx, num_points_vy);
std::make_unique<YourSolver>(num_points_x, num_points_y, num_points_z,
num_points_vx, num_points_vy, num_points_vz);
// wrap tensor in a DistributedFullGrid
dfg_ = std::make_unique<combigrid::DistributedFullGrid<combigrid::CombiDataType>>(
this->getDim(), this->getLevelVector(), lcomm, this->getBoundary(),
Expand Down Expand Up @@ -201,7 +221,8 @@ contiguous array.
The size of the whole "global" grid is $2^{l_d}$, with $l_d$ the level vector
for each dimension $d$.
Every MPI process in your solver should then have $2^{l_d} / p_d$ grid
points, where $p$ is the Cartesian process vector.
points, where $p$ is the Cartesian process vector
containing the number of processes per dimension in each [process group](./parallelism.md).

## Make Your MPI Processes Workers

Expand Down Expand Up @@ -239,15 +260,14 @@ int main(int argc, char** argv) {
combigrid::theMPISystem()->initWorldReusable(MPI_COMM_WORLD, ngroup, nprocs, false, true);

// input parameters
const combigrid::DimType dim = 4; // four-dimensional problem
const combigrid::DimType dim = 6; // six-dimensional problem
const combigrid::LevelVector lmin = {
2, 2, 2,
2}; // minimal level vector for each grid -> have at least 4 points in each dimension
const combigrid::LevelVector lmax = {6, 6, 6, 6}; // maximum level vector -> level vector of target grid
2, 2, 2}; // minimal level vector for each grid -> have at least 4 points in each dimension
const combigrid::LevelVector lmax = {6, 6, 6, 6, 6, 6}; // maximum level vector -> level vector of target grid
const std::vector<int> p = {
1, 1, 1, 1}; // parallelization of domain (one process per dimension) -> must match nprocs
const std::vector<bool> hierarchizationDims = {true, true, true,
true}; // all dimensions should be hierarchized
1, 1, 1, 1, 1, 1}; // parallelization of domain (one process per dimension) -> must match nprocs
const std::vector<bool> hierarchizationDims(dim, true); // all dimensions should be hierarchized
const std::vector<combigrid::BoundaryType> boundary(dim,
1); // periodic boundary in every dimension
const combigrid::real dt = 0.01; // time step
Expand Down