Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Changing data held in element to accessor data #129

Open
wants to merge 65 commits into
base: master
Choose a base branch
from

Conversation

bremerm31
Copy link
Contributor

To optimize cache usage and partially enable vectorization, this PR converts the data from an array of structs layout (AoS) to a struct of arrays layout (SoA). To maintain the ease of use associated with the AoS layout, we are introducing an Accessor class, which will serve as a reference to the data associated with a given element.

bremerm31 and others added 30 commits November 27, 2018 12:03
Adding type trait to detect whether a class is a struct of arrays container.
Such a class must have
 1. A typdef name `Accessor`
 2. A member function `at()`, which accepts an `unsigned int` and returns an `Accessor`
Adding a function to extract members at each `index`.
In the case of a vector of array of structs this will return Accessors
for each array of structs
- Adding missing include in `use_blaze.hpp`
We needed to add support for arrays of struct of arrays. Since
accessors tend not to be default constructible `index_sequence`s to
enable default construction.

Various allocator issues were fixed for the vectors of vectors and
structs of arrays.

Updated `test_at_each.cpp` accordingly.
Additionally, since rows are not default constructible, I added a
helper function for getting an array of rows from an array of
matrices.

`test_linear_algebra.cpp` updated to test functionality.
- Results match to full precision with master for manufactured
  solution
- Unit tests updated accordingly
Various discrepancies between eigen and blaze rows require work
arounds here.
This is the first commit to move the entire hierarchy towards
something where we have containers that contain both data in a struct
of arrays layout and accessors.
In order for an efficient API, we need both the problem data, and
element data struct of arrays to be accessible in one struct. This
will imply that we will set up a second class hierarchy mimicing the
current element mesh hierarchy. From here, vectorized functions will
be called directly on the SoAs.
- Adding missing function definitions for 1D legendre basis
By removing this flipping, we will aim to vectorize more of the interface kernel evaluations, and avoid awkward memory loads and stores.
This requires significant changes to LLF flux API and resultingly
changed files.
- thread_local was not playing nicely with the `double` instantiation
  of `LLF_flux`. Simply replaced with an overload
- Error in accessing surface normals in `test/test_boundary_interface.cpp`
`is_vectorized<T>` returns `std::true_type` if `T` has a constexpr boolean
member `is_vectorized` that is set to true. Otherwise, is_vectorized
returns false type.
- Unitialized values were causing valgrind to come up not clean in `rkdg_swe_data_state.hpp`
- Error in Runge-Kutta update, solutions were being swapped in the
  state variable loop
- `thread_local` duration specifier was also causing issues. Removing
   for now. Will test performance on skylake nodes
- L2 Errors match now
…ssors

- Starting to update how integration is done.
bremerm31 and others added 26 commits December 14, 2018 15:10
Some weirdness in using of block matrices, where the expression
template requires explicit allocation or else it segfaults.
- Developed dg-micro-benchmarks which have been used to find optimal data storage layouts
- This has sped up the volume kernel by a factor of 3x from the baseline
Adding container and vestiges of SoA class to mesh.

Mesh types updated accordingly
- Each interface container will have a pointer to the
  `ElementContainers` for here using for all functionality we will be
  able to generate the sparse matrices for computing UgpIn and UgpEx
  as well as the integrations
- `ElementSoA` was updated so that `BoundaryData` is now
  included (various reserve functions required modification)
- `is_vectorized` in functor is now a `constexpr` function, which
  accepts a template argument. This allows us to dispatch function
  calls based on Element or interface type.
This is the beginning of the vectorization of the interface kernel. We
have outlined the join SoA/Accessor data type, and slightly modified
the interface kernel. Difficulties are arising due to strong typing of
elements requires the entire element SoA to be subsumed into the
pointer class.
- First correct implementation of vectorized interface kernel
- System implements a similar dispatch system based on vectorized
  function objects
- Three outstanding fixmes:
  1. ComputeUgpBdry and integration routines need to only occur once
     This may also allow for partial vectorization of non-vectorized
     interfaces (and potentially boundaries as well.
  2. FMAs need to be incorporated into numerical flux
  3. shrinkToFit() in interface kernel is causing memory leaks
X
- Adding missing interface data structure
Gcc doesn't correctly implement `_mm512_abs_pd`  as of gcc/8.1.0 and we are thus using `max(a,-a)` as the intrinsic to get around this.
- To allow for optimal memory use/utilizing vectorization all data layouts relating to the interface kernel have been moved to a Column Major layout
- This gives an additional speed-up of 1.5x for the interface kernel
- Total speed-up relative to last commit is 1.24
- Total speed-up relative to baseline (point of forking) is 2.12
Updated distributed boundaries with new SoA dispatch system.
Numerically verified for OMPI runs
- using `DynMatrices` causes stack overflows for HPX
- this should replace a complexity of O(n^2) with O(n) (on average)
Done to potentially avoid overhead associated with multithreaded MPI
Initializing sparse matrices was being done very inefficiently. To rememdy this we changed initializations of all `CompressedMatrices`. This included sections in `element_soa.hpp` and `interface_soa.hpp`. Initialization appears to be scaling linearly on the number of elements based on studies run.

| number of elements | Time (in seconds) |
|--------------------|-------------------|
|      65k           |      1.9          |
|     262k           |      8.1          |
|    1049k           |     32.0          |
Starting to refactor code support SoA layout
- note that there remain errors with solving linear systems.
- Will need to cherry pick bc9d45a
When solving linear systems using blaze, the matrix must be column
order otherwise. Blaze with solve `A^T X = B`.
- Add overload to solve systems with row major storage
- Explicitly make `delta_hat_global` column major for performance
@bremerm31
Copy link
Contributor Author

With 6b861b0, EHDG-SWE works in serial.

Both OMPI and HPX parallelizations match the serial implementation to
full numerical precision.
Code compiles, however we have not been able to verify correctness.
As a note to self, potential areas of concern remain:
- use of gp_ex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant