WIP: Changing data held in element to accessor data #129

bremerm31 · 2018-11-27T18:06:41Z

To optimize cache usage and partially enable vectorization, this PR converts the data from an array of structs layout (AoS) to a struct of arrays layout (SoA). To maintain the ease of use associated with the AoS layout, we are introducing an Accessor class, which will serve as a reference to the data associated with a given element.

Adding type trait to detect whether a class is a struct of arrays container. Such a class must have 1. A typdef name `Accessor` 2. A member function `at()`, which accepts an `unsigned int` and returns an `Accessor`

Adding a function to extract members at each `index`. In the case of a vector of array of structs this will return Accessors for each array of structs - Adding missing include in `use_blaze.hpp`

We needed to add support for arrays of struct of arrays. Since accessors tend not to be default constructible `index_sequence`s to enable default construction. Various allocator issues were fixed for the vectors of vectors and structs of arrays. Updated `test_at_each.cpp` accordingly.

Additionally, since rows are not default constructible, I added a helper function for getting an array of rows from an array of matrices. `test_linear_algebra.cpp` updated to test functionality.

- Results match to full precision with master for manufactured solution - Unit tests updated accordingly

Various discrepancies between eigen and blaze rows require work arounds here.

This is the first commit to move the entire hierarchy towards something where we have containers that contain both data in a struct of arrays layout and accessors.

In order for an efficient API, we need both the problem data, and element data struct of arrays to be accessible in one struct. This will imply that we will set up a second class hierarchy mimicing the current element mesh hierarchy. From here, vectorized functions will be called directly on the SoAs.

- Adding missing function definitions for 1D legendre basis

By removing this flipping, we will aim to vectorize more of the interface kernel evaluations, and avoid awkward memory loads and stores.

This requires significant changes to LLF flux API and resultingly changed files.

- thread_local was not playing nicely with the `double` instantiation of `LLF_flux`. Simply replaced with an overload - Error in accessing surface normals in `test/test_boundary_interface.cpp`

`is_vectorized<T>` returns `std::true_type` if `T` has a constexpr boolean member `is_vectorized` that is set to true. Otherwise, is_vectorized returns false type.

- Unitialized values were causing valgrind to come up not clean in `rkdg_swe_data_state.hpp` - Error in Runge-Kutta update, solutions were being swapped in the state variable loop - `thread_local` duration specifier was also causing issues. Removing for now. Will test performance on skylake nodes

- L2 Errors match now

This reverts commit 308b398.

This reverts commit 43be065.

…ssors - Starting to update how integration is done.

Some weirdness in using of block matrices, where the expression template requires explicit allocation or else it segfaults.

- Developed dg-micro-benchmarks which have been used to find optimal data storage layouts - This has sped up the volume kernel by a factor of 3x from the baseline

Adding container and vestiges of SoA class to mesh. Mesh types updated accordingly

- Each interface container will have a pointer to the `ElementContainers` for here using for all functionality we will be able to generate the sparse matrices for computing UgpIn and UgpEx as well as the integrations - `ElementSoA` was updated so that `BoundaryData` is now included (various reserve functions required modification)

- `is_vectorized` in functor is now a `constexpr` function, which accepts a template argument. This allows us to dispatch function calls based on Element or interface type.

This is the beginning of the vectorization of the interface kernel. We have outlined the join SoA/Accessor data type, and slightly modified the interface kernel. Difficulties are arising due to strong typing of elements requires the entire element SoA to be subsumed into the pointer class.

- First correct implementation of vectorized interface kernel - System implements a similar dispatch system based on vectorized function objects - Three outstanding fixmes: 1. ComputeUgpBdry and integration routines need to only occur once This may also allow for partial vectorization of non-vectorized interfaces (and potentially boundaries as well. 2. FMAs need to be incorporated into numerical flux 3. shrinkToFit() in interface kernel is causing memory leaks

- Adding missing interface data structure

Gcc doesn't correctly implement `_mm512_abs_pd` as of gcc/8.1.0 and we are thus using `max(a,-a)` as the intrinsic to get around this.

- To allow for optimal memory use/utilizing vectorization all data layouts relating to the interface kernel have been moved to a Column Major layout - This gives an additional speed-up of 1.5x for the interface kernel - Total speed-up relative to last commit is 1.24 - Total speed-up relative to baseline (point of forking) is 2.12

Updated distributed boundaries with new SoA dispatch system. Numerically verified for OMPI runs

- using `DynMatrices` causes stack overflows for HPX

- this should replace a complexity of O(n^2) with O(n) (on average)

Done to potentially avoid overhead associated with multithreaded MPI

Initializing sparse matrices was being done very inefficiently. To rememdy this we changed initializations of all `CompressedMatrices`. This included sections in `element_soa.hpp` and `interface_soa.hpp`. Initialization appears to be scaling linearly on the number of elements based on studies run. | number of elements | Time (in seconds) | |--------------------|-------------------| | 65k | 1.9 | | 262k | 8.1 | | 1049k | 32.0 |

Starting to refactor code support SoA layout - note that there remain errors with solving linear systems. - Will need to cherry pick bc9d45a

When solving linear systems using blaze, the matrix must be column order otherwise. Blaze with solve `A^T X = B`. - Add overload to solve systems with row major storage - Explicitly make `delta_hat_global` column major for performance

bremerm31 · 2019-05-14T15:52:47Z

With 6b861b0, EHDG-SWE works in serial.

Both OMPI and HPX parallelizations match the serial implementation to full numerical precision.

Code compiles, however we have not been able to verify correctness. As a note to self, potential areas of concern remain: - use of gp_ex

bremerm31 and others added 30 commits November 27, 2018 12:03

Changing data held in element to accessor data

81b9f08

Adding is_SoA type trait

fa240ea

Adding type trait to detect whether a class is a struct of arrays container. Such a class must have 1. A typdef name `Accessor` 2. A member function `at()`, which accepts an `unsigned int` and returns an `Accessor`

Adding Utilities::at_each for vectors of containers

a89095e

Adding a function to extract members at each `index`. In the case of a vector of array of structs this will return Accessors for each array of structs - Adding missing include in `use_blaze.hpp`

Added DynRow to linear_algebra.hpp

f6761de

Additionally, since rows are not default constructible, I added a helper function for getting an array of rows from an array of matrices. `test_linear_algebra.cpp` updated to test functionality.

Conversion of State to SoA

4c5156b

- Results match to full precision with master for manufactured solution - Unit tests updated accordingly

Moving from ColXpr to Map for Row representation

86b59bf

Setting row major storage layout as default for eigen

a6396bc

Merge branch 'SoA' of github.com:bremerm31/dgswemv2 into SoA

5a20e35

Adding blaze compatibility

4fa2399

Various discrepancies between eigen and blaze rows require work arounds here.

X

ba41ca0

Placing Elements into opaque containers

94d3113

This is the first commit to move the entire hierarchy towards something where we have containers that contain both data in a struct of arrays layout and accessors.

Fixing broken unit tests

def2628

- Adding missing function definitions for 1D legendre basis

Flipping integration order on EX side of interface.

2d35e87

By removing this flipping, we will aim to vectorize more of the interface kernel evaluations, and avoid awkward memory loads and stores.

Refactor LLF flux for vectorization

649bde4

This requires significant changes to LLF flux API and resultingly changed files.

Updating test_llf_flux to test all fluxes simultaneously

6ca68da

Vectorized LLF flux evaluations

7b1cec3

Removing temporaries associated with normal vectors

a76d0a3

Unrolling variable loops in interface kernel

94a11f3

Computing more thread_local temporaries in LLF flux

334b92d

Minor fixes

beafa56

- thread_local was not playing nicely with the `double` instantiation of `LLF_flux`. Simply replaced with an overload - Error in accessing surface normals in `test/test_boundary_interface.cpp`

Adding is_vectorized type trait

ed029c2

`is_vectorized<T>` returns `std::true_type` if `T` has a constexpr boolean member `is_vectorized` that is set to true. Otherwise, is_vectorized returns false type.

Adding test_is_vectorized.cpp

d2db177

Re-adding temporaries

43be065

- L2 Errors match now

Makeing q_at_gp in interface kernel thread_local temporary

308b398

Revert "Makeing q_at_gp in interface kernel thread_local temporary"

bbfd512

This reverts commit 308b398.

Revert "Re-adding temporaries"

c8df987

This reverts commit 43be065.

Starting to store more data in SoA and simply giving the element acce…

2417c09

…ssors - Starting to update how integration is done.

bremerm31 and others added 26 commits December 14, 2018 15:10

Vectorizing Volume kernel first attemp

efbbbd6

Some weirdness in using of block matrices, where the expression template requires explicit allocation or else it segfaults.

Reordering Data Layout using update storage orders

21dedb0

- Developed dg-micro-benchmarks which have been used to find optimal data storage layouts - This has sped up the volume kernel by a factor of 3x from the baseline

SoA-ifying interfaces and boundaries

7d46279

Adding container and vestiges of SoA class to mesh. Mesh types updated accordingly

Updating is_vectorized

dab1289

- `is_vectorized` in functor is now a `constexpr` function, which accepts a template argument. This allows us to dispatch function calls based on Element or interface type.

X

9e31d83

- Adding missing interface data structure

Adding absolute value work around for gcc

d9dade0

Gcc doesn't correctly implement `_mm512_abs_pd` as of gcc/8.1.0 and we are thus using `max(a,-a)` as the intrinsic to get around this.

OMPI working

f4c5e42

Updated distributed boundaries with new SoA dispatch system. Numerically verified for OMPI runs

Starting to port vectorized code to HPX

f1187f6

Updating blaze options

d3c6984

Storing diagonal matrices as sparse matrices

1a0910d

Turning internal local matrices into DynMatrices

249f6f3

- using `DynMatrices` causes stack overflows for HPX

Adding hash table for assembly of scatter/gather matrices

afabdc7

- this should replace a complexity of O(n^2) with O(n) (on average)

Placing barrier between initialization and timestepping

d9645a3

Reducing memory usage by removing temporaries

0968ade

Only requesting MPI_SINGLE_THREAD if omp_max_num_threads=0

22bdf56

Done to potentially avoid overhead associated with multithreaded MPI

Minor fixes in partitioner

7a68320

Merge branch 'master' into SoA

6fec93d

Updating linear algebra compatibilities

0859b26

Merge branch 'master' of github.com:UT-CHG/dgswemv2 into SoA

065876e

Porting EHDG-SWE to SoA Layout

5ded46c

Starting to refactor code support SoA layout - note that there remain errors with solving linear systems. - Will need to cherry pick bc9d45a

Fixing solving of linear systems

6b861b0

When solving linear systems using blaze, the matrix must be column order otherwise. Blaze with solve `A^T X = B`. - Add overload to solve systems with row major storage - Explicitly make `delta_hat_global` column major for performance

bremerm31 added 3 commits May 14, 2019 12:08

Distributed EHDG works

64ce026

Both OMPI and HPX parallelizations match the serial implementation to full numerical precision.

Adding scaling columns and rows functionality

bfd653c

Intermediate IHDG Commit

0c7f00c

Code compiles, however we have not been able to verify correctness. As a note to self, potential areas of concern remain: - use of gp_ex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Changing data held in element to accessor data #129

WIP: Changing data held in element to accessor data #129

bremerm31 commented Nov 27, 2018

bremerm31 commented May 14, 2019

WIP: Changing data held in element to accessor data #129

Are you sure you want to change the base?

WIP: Changing data held in element to accessor data #129

Conversation

bremerm31 commented Nov 27, 2018

bremerm31 commented May 14, 2019