diff --git a/docs/advanced_topics.md b/docs/advanced_topics.md index cf088421..f31502de 100644 --- a/docs/advanced_topics.md +++ b/docs/advanced_topics.md @@ -9,6 +9,7 @@ - J. Garcke. ‘Sparse Grids in a Nutshell’. In: Sparse Grids and Applications. Ed. by J. Garcke, M. Griebel. Lecture Notes in Computational Science and Engineering. Berlin, Heidelberg: Springer, 2013, pp. 57–80. + url: - D. Pflüger. Spatially Adaptive Sparse Grids for High-Dimensional Problems. Verlag Dr. Hut, 2010. - W. Guo, Y. Cheng. ‘A Sparse Grid Discontinuous Galerkin Method for High-Dimensional @@ -29,23 +30,19 @@ of sparse grid problems’. In: Proceedings of the IMACS International Symposium on Iterative Methods in Linear Algebra: Brussels, Belgium, 2 - 4 April, 1991. Ed. by P. de Groen, R. Beauwens. North Holland, 1992. + url: - M. Heene. ‘A Massively Parallel Combination Technique for the Solution of High-Dimensional PDEs’. PhD thesis. Institut für Parallele und Verteilte Systeme der Universität Stuttgart, 2018. + url: ## Variants of the Combination Technique in DisCoTec -### Truncated Combination Technique +### Adaptivity DisCoTec by default uses the *truncated combination technique*, which sets a minimum level for all component grids used. -- J. Benk, D. Pflüger. ‘Hybrid parallel solutions of the Black-Scholes PDE with - the truncated combination technique’. In: 2012 International Conference on - High Performance Computing Simulation (HPCS). July 2012. - -### Adaptivity - Of the two ways of bringing adaptivity into the combination technique, DisCoTec currently supports only one: Static dimensional adaptivity can be achieved by providing the combination @@ -58,6 +55,11 @@ If your application needs dynamic (= during runtime) or spatial adaptivity - T. Gerstner, M. Griebel. ‘Dimension–adaptive tensor–product quadrature’. In: Computing 71.1 (2003), pp. 65–87. + url: +- J. Benk, D. Pflüger. ‘Hybrid parallel solutions of the Black-Scholes PDE with + the truncated combination technique’. In: 2012 International Conference on + High Performance Computing Simulation (HPCS). July 2012. + url: - C. Kowitz. ‘Applying the Sparse Grid Combination Technique in Linear Gyrokinetics’. Dissertation. München: Technische Universität München, 2016. - M. Obersteiner. ‘A spatially adaptive and massively parallel implementation of @@ -149,9 +151,10 @@ The last reference below shows how conservation of mass and L2 stability is only provided by the latter two. In practice, we have observed that using hierarchical hat functions and long combination intervals (many time steps per combination) is fine with relatively -laminar PDE solutions. -But in the turbulent regime, it becomes necessary to use the CDF wavelets and to -combine after every PDE solver time step to avoid numerical instability. +laminar PDE solutions +(both in fluid dynamics and plasma turbulence simulations). +However, in the turbulent regime, it becomes necessary to use the CDF wavelets +and to combine after every PDE solver time step to avoid numerical instability. If you find yourself in need of higher orders of accuracy or conservation, you could add higher-order CDF wavelets to `DistributedHierarchization.hpp`. @@ -245,7 +248,7 @@ with LZ4. ### Using PDE Solvers Written In Other Programming Languages -Your functions need the same described interface and need to somehow expose it +Your functions need the same described interface and need to expose it to the C++ compiler. For Fortran, the SelalibTask shows how the [relevant functions](https://github.com/selalib/selalib/blob/main/simulations/parallel/bsl_vp_3d3v_cart_dd/sll_m_sim_bsl_vp_3d3v_cart_dd_slim_interface.F90) can be [made accessible with `extern "C"`](https://github.com/SGpp/DisCoTec/blob/main/examples/selalib_distributed/src/SelalibTask.hpp). diff --git a/docs/combination_technique.md b/docs/combination_technique.md index 0c5ede88..c11daf61 100644 --- a/docs/combination_technique.md +++ b/docs/combination_technique.md @@ -16,13 +16,58 @@ Taken together, all component grids form a sparse grid approximation, which can be explicitly obtained by a linear superposition of the individual grid functions, with the so-called combination coefficients. +The component grids +can be identified with their level vector $\vec{\ell} \in \mathbb{N}^d$ whose length +equals the dimensionality $d$ of the problem. +From it, one can derive the +grid's spacing $\vec{h}$ as $\vec{h} = \frac{1}{2^{\ell}}$, +assuming the domain is $[0,1)^d$. + ![schematic of a combination scheme in 2D](../gfx/combischeme-2d.svg) -In this two-dimensional combination scheme, all combination coefficients are 1 -and -1, respectively. Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/handle/11682/14229), adapted from Pollinger et al. [2023](https://dl.acm.org/doi/10.1145/3581784.3607036)). +For instance, a the $x$-direction level of 5 denotes a the grid spacing in the +$x$ dimension of $h_x = \frac{1}{32}$. Assuming periodicity, this means we use 32 +points in dimension $x$, like illustrated by the rightmost component grid in the +above figure. + +Here, we use the truncated combination technique, which sets a minimum and maximum +level, $\vec{\ell^\text{min}}$ and $\vec{\ell^\text{max}}$, to select the +component grids $\vec{\ell} \in \mathcal{I}$. +The set of all selected component grids $\mathcal{I}$ is referred to as the +"combination scheme". +For regularity, we assume a constant difference between $\vec{\ell^\text{min}}$ +and $\vec{\ell^\text{max}}$ in each dimension, but [other schemes can be useful](./advanced_topics). + + +One can then compute suitable combination coefficients $c_{\vec{\ell}}^c$ +that are used to obtain a sparse grid function +$f_\text{SG}$ by way of a linear superposition of +functions $f_{\vec{\ell}}$ defined on the component grid: + +$f_\text{SG} = \sum_{\vec{\ell} \in \mathcal{I}} c_{\vec{\ell}}^c\cdot f_{\vec{\ell}}$ + +In the two-dimensional combination scheme in the above figure, all combination +coefficients are 1 and -1, respectively. +The sparse grid is illustrated on the upper right. + +This sparse grid function can be expected to be accurate in +$\mathcal{O}(h^2 \cdot \log(h^{d-1}))$, given some assumptions on the bounds of +mixed-dimension partial derivatives, where $N$ is the number of points corresponding +to the finest resolution occurring in the scheme. +At the same time, the number of points is only $\mathcal{O}(d\cdot N(\log N)^{d-1})$, +drastically less than the $\mathcal{O}(N^d)$ needed for a finely-resolved full +grid. +Thus, while the gains in accuracy per degrees of freedom are the most prominent for +high dimensionalities ($d \geq 4$), even two-dimensional schemes can benefit +significantly for fine resolutions. + + + Between time steps, the grids exchange data through an intermediate multi-scale represenation, which is summarized as the "combination" step in DisCoTec. Assuming a certain smoothness in the solution, this allows for a good diff --git a/docs/getting_started.md b/docs/getting_started.md index b8ce729d..c2618b56 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -117,6 +117,11 @@ source folder. ## Run the tests +Beware: Most of the tests in DisCoTec need significant resources +(eight to nine MPI processes, +file I/O, some performance tests when built in Release mode). +Make sure you run them on appropriate hardware. + To run the compiled tests, go to folder `tests` and make sure the correct MPI runtime is loaded to your path, e.g. with @@ -154,6 +159,32 @@ The exact format and naming in `ctparam` is not (yet) standardized, to allow adaptation for different PDE solver applications. Please refer to existing parameter files and example implementations. +The different examples intend to illustrate various aspects of DisCoTec: + +- [`combi_example`](https://github.com/SGpp/DisCoTec/blob/main/examples/combi_example/): + manager-worker setup with a generic `TaskExample` + that interpolates a cosine function on the component grids +- [`combi_example_faults`](https://github.com/SGpp/DisCoTec/blob/main/examples/combi_example_faults/): + like `combi_example`, but with fault tolerance functionality + (requires that DisCoTec is built with the `DISCOTEC_ENABLEFT` flag) +- [`distributed_advection`](https://github.com/SGpp/DisCoTec/blob/main/examples/distributed_advection/): + manager-worker setup using a first-order upwinding advection solver, + illustrates how to evaluate the Monte-Carlo norms + (and error norms, as an analytical solution is given for this PDE problem) +- [`combi_workers_only`](https://github.com/SGpp/DisCoTec/blob/main/examples/combi_workers_only/): + like `distributed_advection`, but with worker-only setup; + illustrates the static task assignment required for worker-only scenarios +- [`distributed_third_level`](https://github.com/SGpp/DisCoTec/blob/main/examples/distributed_third_level/): + like `distributed_advection` and `combi_workers_only`; + adds logic for widely-distributed simulations +- GENE examples, starting with `gene_`: show the interaction with + the GENE gyrokinetic solver (using a complicated build/link setup + that we don't recommend to use any more, see next example instead) +- [`selalib_distributed`](https://github.com/SGpp/DisCoTec/blob/main/examples/selalib_distributed/): + shows the interaction with a Semi-Lagrangian Vlasov solver in SeLaLib: + a static library is built as part of SeLaLib, and linked through DisCoTec's + `Task` interface in `SelalibTask`. + (pinning-with-various-mpi-implementations)= ### Pinning With Various MPI Implementations diff --git a/docs/parallelism.md b/docs/parallelism.md index f364533f..d23120e7 100644 --- a/docs/parallelism.md +++ b/docs/parallelism.md @@ -16,9 +16,17 @@ Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/h All component grids are distributed to process groups (either statically, or dynamically through the manager rank). +Each process group will be responsible for a set of component grids and the +solver instance belonging to each. +Within a process group and each grid, one can observe the exact same domain +decomposition parallelism as when using the solver directly. +(The domain decomposition needs to be the same for all solver instances.) + +During a time step, the PDE solver step applies one or multiple time step updates +to the values in each component grid. During the PDE solver time step and most of the combination step, MPI communication only happens within the process groups. -Conversely, for the sparse grid reduction using the combination coefficients, +Conversely, for the sparse grid reduction using the combination coefficients $c_{\vec{\ell}}^c$, MPI communication only happens between a rank and its colleagues in the other process groups, e.g., rank 0 in group 0 will only talk to rank 0 in all other groups. Thus, major bottlenecks arising from global communication can be avoided altogether. @@ -39,21 +47,25 @@ If needed, the exact distribution of coordinates to ranks is set in the Combining the two ways of scaling up, DisCoTec's scalability was demonstrated on several machines, with the experiments comprising up to 524288 cores: -![timings for advection solver step on HAWK at various +![timings for 6-d advection solver step on HAWK at various parallelizations](../gfx/times-solver-on-hawk.svg) ![timings for combination step on HAWK at various parallelizations](../gfx/times-combination-on-hawk.svg) -We see the timings (in seconds) for the advection solver step and the -combination step, respectively. +We see the timings (in seconds) for the solver and combination steps respectively, +for an illustrative (6-D) PDE problem. This weak scaling experiment used four OpenMP threads per rank, and starts with one pg of four processes in the upper left corner. The largest parallelization is 64 pgs of 2048 processes each. Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/handle/11682/14229)). This image also describes the challenges in large-scale experiments with DisCoTec: -If the process groups become too large, the MPI communication of the multiscale +Generally, the process group size can be chosen so that it matches the solver +and the desired resolutions well. +But if the process groups become too large, the MPI communication of the multiscale transform starts to dominate the combination time. +Conversely, the number of process groups can be freely chosen, as long as there +is sufficient work for each group to do (one grid per group at a bare minimum). If there are too many pgs, the combination reduction will dominate the combination time. However, the times required for the PDE solver stay relatively constant; diff --git a/docs/simple_tutorial.md b/docs/simple_tutorial.md index 7a5eb773..5fd82ebd 100644 --- a/docs/simple_tutorial.md +++ b/docs/simple_tutorial.md @@ -15,6 +15,8 @@ int num_points_x = ... int num_points_y = ... int num_points_z = ... int num_points_vx = ... +int num_points_vy = ... +int num_points_vz = ... ... initial_function = ... grid.initialize( @@ -23,6 +25,8 @@ grid.initialize( num_points_y, num_points_z, num_points_vx, + num_points_vy, + num_points_vz, ... ) helper_data_structures.initialize(...) @@ -39,6 +43,8 @@ properties = compute_properties(grid) write_output(properties, grid) ``` +This example assumes that your PDE solver uses three dimensions in both space and velocity. + For use with DisCoTec, we assume nestable power-of-two discretizations, i.e., where your grid spacing can be $2^{-l}$ for $l \in \mathbb{N}$. For simplicity, in this tutorial we also assume you use periodic boundary @@ -60,6 +66,8 @@ int num_points_x = ... int num_points_y = ... int num_points_z = ... int num_points_vx = ... +int num_points_vy = ... +int num_points_vz = ... ... - initial_function = ... @@ -69,6 +77,8 @@ int num_points_vx = ... - num_points_y, - num_points_z, - num_points_vx, +- num_points_vy, +- num_points_vz, - ... - ) - helper_data_structures.initialize(...) @@ -77,6 +87,8 @@ int num_points_vx = ... + num_points_y, + num_points_z, + num_points_vx, ++ num_points_vy, ++ num_points_vz, + ... + ) float end_time = ... @@ -112,7 +124,9 @@ C++ header `YourSolver.h`, roughly like this: class YourSolver { public: - YourSolver(int num_points_x, int num_points_y, int num_points_vx, int num_points_vy, ...); + YourSolver(int num_points_x, int num_points_y, int num_points_z, + int num_points_vx, int num_points_vy, int num_points_vz, + ...); //...rule of 5... void run(double time_step); @@ -143,9 +157,12 @@ With `YourSolver`, that will be as simple as: class YourTask : public combigrid::Task { public: - YourTask(combigrid::LevelVector& l, const std::vector& boundary, - combigrid::real coeff, combigrid::real dt) - : Task(4, l, boundary, coeff, nullptr, nullptr), sim_(nullptr), dfg_(nullptr), dt_(dt) {} + YourTask(combigrid::LevelVector& l, + const std::vector& boundary, + combigrid::real coeff, + combigrid::real dt) + : Task(4, l, boundary, coeff, nullptr, nullptr), + sim_(nullptr), dfg_(nullptr), dt_(dt) {} virtual ~YourTask(){ if (sim_ != nullptr){ @@ -162,17 +179,20 @@ class YourTask : public combigrid::Task { const auto& l = this->getLevelVector(); int num_points_x = combigrid::powerOfTwoByBitshift(l[0]); int num_points_y = combigrid::powerOfTwoByBitshift(l[1]); - int num_points_vx = combigrid::powerOfTwoByBitshift(l[2]); - int num_points_vy = combigrid::powerOfTwoByBitshift(l[3]); + int num_points_z = combigrid::powerOfTwoByBitshift(l[2]); + int num_points_vx = combigrid::powerOfTwoByBitshift(l[3]); + int num_points_vy = combigrid::powerOfTwoByBitshift(l[4]); + int num_points_vz = combigrid::powerOfTwoByBitshift(l[5]); // this is the number of MPI processes in each dimension -- // if all are 1, we are only using the parallelism between grids - std::vector p = {1, 1, 1, 1}; + std::vector p = {1, 1, 1, 1, 1, 1}; // if using MPI within your solver, // pass p and the lcomm communicator to sim_, too sim_ = - std::make_unique(num_points_x, num_points_y, num_points_vx, num_points_vy); + std::make_unique(num_points_x, num_points_y, num_points_z, + num_points_vx, num_points_vy, num_points_vz); // wrap tensor in a DistributedFullGrid dfg_ = std::make_unique>( this->getDim(), this->getLevelVector(), lcomm, this->getBoundary(), @@ -201,7 +221,8 @@ contiguous array. The size of the whole "global" grid is $2^{l_d}$, with $l_d$ the level vector for each dimension $d$. Every MPI process in your solver should then have $2^{l_d} / p_d$ grid -points, where $p$ is the Cartesian process vector. +points, where $p$ is the Cartesian process vector +containing the number of processes per dimension in each [process group](./parallelism.md). ## Make Your MPI Processes Workers @@ -239,15 +260,14 @@ int main(int argc, char** argv) { combigrid::theMPISystem()->initWorldReusable(MPI_COMM_WORLD, ngroup, nprocs, false, true); // input parameters - const combigrid::DimType dim = 4; // four-dimensional problem + const combigrid::DimType dim = 6; // six-dimensional problem const combigrid::LevelVector lmin = { 2, 2, 2, - 2}; // minimal level vector for each grid -> have at least 4 points in each dimension - const combigrid::LevelVector lmax = {6, 6, 6, 6}; // maximum level vector -> level vector of target grid + 2, 2, 2}; // minimal level vector for each grid -> have at least 4 points in each dimension + const combigrid::LevelVector lmax = {6, 6, 6, 6, 6, 6}; // maximum level vector -> level vector of target grid const std::vector p = { - 1, 1, 1, 1}; // parallelization of domain (one process per dimension) -> must match nprocs - const std::vector hierarchizationDims = {true, true, true, - true}; // all dimensions should be hierarchized + 1, 1, 1, 1, 1, 1}; // parallelization of domain (one process per dimension) -> must match nprocs + const std::vector hierarchizationDims(dim, true); // all dimensions should be hierarchized const std::vector boundary(dim, 1); // periodic boundary in every dimension const combigrid::real dt = 0.01; // time step