Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding Up FDTD Engine By an Improved Multi-dimensional Arrays Implementation "ArrayLib" with Contiguous Memory #161

Merged
merged 16 commits into from
Dec 29, 2024

Conversation

biergaizi
Copy link
Contributor

@biergaizi biergaizi commented Nov 22, 2024

TL;DR

Motivation

This is a resubmitted version of PR #117, reworked for the third time. The difference is that, this time, a small C++ class named ArrayLib is implemented, and the openEMS's FDTD engine is converted to use the new ArrayLib code.

In ArrayLib, by using C++ template metaprogramming, the existing array indexing syntax arr[n][i][j][k] can be kept without changing them to arr(n, i, j, k). This way, we achieve two goals:

  1. Clear up the path for the upcoming Tiling engine, which is also based on the new array implementation.
  2. It minimizes the changes to the existing engine code. This is useful because the upcoming Tiling engine code is experimental, so we want to keep the old one for reference.

Please consider merging this change as soon as it's appropriately to do so, so that the next phase of development can start. The best is yet to come.

Result

Correctness

All built-in Python examples were executed and the simulation results are exactly the same. Furthermore, field dumps of a third-party pyems simulation called gcpw.py has been executed for hours, the field dumps were compared with the unmodifed openEMS code, and all floating-point numbers are bit-identical.

Note that openEMS automatically terminates simulation if energy is lower than the end criteria, but the exact timestep of termination depends on simulation speed, thus the same simulation is not reproducible by default - This is not a bug in the PR. To replicate the simulation result, it's necessary to (1) Set a impossible end criteria such as 1e-9, so a simulation is not stopped early. (2) manually specify the NrTS in a simulation, so that the same simulation would always stop at the same timestep.

Performance

Performance is improved by 5%-30% in medium and large-size simulations. The most extreme improvement (30%) is seen in long and narrow structure. This is expected because the old array code performs the worst if the last stride is non-contiguous, which is exactly what the new code is meant to fix. For small simulation there can be a performance degradation of 4%-5% or so, but given their small sizes there slowdowns are negligible.

Since the results are rather similar to the benchmarks shown in the old PR, only the benchmark results on one machine is shown here:

Zen 2 Benchmark


ArrayLib - Simple Multi-Dimensional Array Library with C-like Syntax

Motivation

Originally, multi-dimensional arrays in the simulation engine were allocated as multi-level C pointers. For example, the following code allocates and accesses a 3D array:

// allocate 3D arrays as pointers of pointers
T*** array=NULL;
array = new T**[SIZE_I]
for (size_t i = 0; i < SIZE_I; i++)
{
	array[i] = new T*[SIZE_J]
	for (size_t j = 0; j < SIZE_J; j++)
	{
        array[i][j] = new T[SIZE_K];
        for (size_t k = 0; k < SIZE_K; k++)
		{
            // access array via C-style syntax
			array[i][j][k] = 0;
		}
	}
}

The advantage is that all multi-dimensional arrays can be accessed like a C-style array. But with several downsides:

  1. A single array requires multiple malloc() or new allocation.
  2. The allocated memory is not contiguous, reducing memory locality and performance, especially when the final dimension is small.

According to the best practice today (e.g. std::mdspan), these raw pointers should be converted to wrapped arrays classes with their internal indexing code, and they should be accessed using the following syntax:

array(i, j, k) = 0;
array[i, j, k] = 0; // C++23

But this requires massive change of all existing array indexing code throughout the codebase, which is error-prone and difficult to review.

ArrayLib is a simple library written with the goal of providing a simple implementation of multi-dimensional array, but preserved the C-like access syntax using C++ template metaprogramming to minimize the code change in the simulation engine.

Supported Types

  1. ArrayIJ - 2D array.
  2. ArrayIJK - 3D array.
  3. ArrayNIJK - 3D array (N = 3 only), IJK is the coordinate of a 3D vector in space, N (0, 1, 2) selects the X/Y/Z value from the vector.

To avoid ambiguity, IJK is always used to describe coordinates in space, not X/Y/Z.

Usage 1: new/delete

ArrayLib can be used in several ways, the first method closely mirrors the C-style code using raw pointers, allowing direct migration from legacy code.

ArrayLib::ArrayIJK<float>* arr_ptr = new ArrayLib::ArrayIJK("name", {10, 20, 30});

// arr_ptr[0][1][2] is wrong, because we need to call the operator[]
// of ArrayIJK, rather than accessing an array of pointers of type
// `ArrayIJK*`.
(*arr_ptr)[0][1][2] = 3;

// But one can use C++ reference rather than pointers to access it
// in a readable way, add a line of boilerplate, but more readable
ArrayLib::ArrayIJK<float>& arr = *arr_ptr;
arr[0][1][2] = 3;

delete arr_ptr;

The first method is advantageous in the sense that all existing pointers with inconsistent lifetimes can be converted to ArrayLib directly as drop-in replacement. The disadvantage is that the C-style syntax is only emulated we're accessing a C++ reference, not a C pointer. Yet C++ reference can't be NULL, so when arrays with inconsistent lifetimes are used in classes, C-style pointers (with confusing syntax) must be kept as member variables, not references (with natural syntax).

To write readable code, whenever an array is accessed by a member function, we must dereference a pointer to create a C++ reference first. Overall, this is the smoothest migration path, so this style is widely used in engine classes.

Usage 2: As Local Member Variable w/ RAII

ArrayLib arrays can be used as local variables by declaring them directly (with name and size). These arrays are automatically freed by C++'s RAII mechanics when they go out of scope.

void process()
{
    ArrayLib::ArrayIJK<float> arr("name", {10, 20, 30});
    array[0][1][2] = 0;
    // deleted automatically by RAII
}

This allows one to access arrays using the natural syntax, and without manual memory management.

Usage 3: As Class Member Variable w/ RAII

Similarly, ArrayLib arrays can also be used as class member variables in conformance to the C++ RAII idiom. Class member variables are recommended for arrays with runtime-determined size.

class Engine
{
    // automatically deleteled by RAII
    ArrayLib::ArrayIJK<float> arr1("static size", {10, 20, 30});
    ArrayLib::ArrayIJK<float> arr2;
    
    Engine(size_t i, size_t j, size_t k) :
        arr2("initialize in constructor", {i, j, k})
    {
        // use natural C-style syntax for acess
        arr1[1][2][3] = 0;
        arr2[4][5][6] = 1;
    }
}

Ideally, this should be the recommended usage for ArrayLib arrays in C++ classes.

Unfortunately, because the existing codebase does not use the C++ RAII idiom, and arrays mostly have inconsistent lifetimes (e.g. sometimes, array sizes can't be determined in the constructor, but instead is determined in the middle of the class's lifetime from an external source), it's not possible to translate all existing array usages to this idiom.

Usage 4: As Class Member Variable w/ Two-Phase Initialization

As a compromise to the lifetime problem, two-phase Initialization is also supported. This can be done by creating an array object (not reference or pointers) as a class member variable without specifying its name and size. The result is a placeholder array in an invalid state, which can be later initialized.

class Engine
{
    // automatically deleteled by RAII, size unknown
    ArrayLib::ArrayIJK<float> arr1;
    
    Engine(size_t i, size_t j, size_t k);
    {
        // two-phase initialization before use
        arr1.Init("name", {i, j, k});

        // use natural C-style syntax for acess
        arr1[1][2][3] = 0;
    }
}

Although it's potentially unsafe, but one can still enjoy automatic release of memory upon class destruction by C++'s RAII.

ArrayLib Arrays Can't be Passed by Value

ArrayLib objects are simple wrappers to heap-allocated memory, with few extra features in comparison to raw C pointers. In fact, they're designed to serve as drop-in replacement of raw pointers, which is prevalent in the existing codebase (we expect to refactor the codebase progressively). Thus, they're not meant to be high-level C++ containers. ArrayLib objects must be always passed as pointers or references, pass-by-value and copy-assignment are not supported. These operators are marked as deleted, attempting to do so would create a compile-time error. They make little sense for the low-level numerical simulation kernel in its current form.

ArrayLib::ArrayIJK<float> data("name", size);

// pass-by-reference, recommended when possible
void process_ref(ArrayLib::ArrayIJK<float>& arr);
process_ref(data);

// pass-by-pointer, use only when necessary
void process_ptr(ArrayLib::ArrayIJK<float>* arr_ptr);
process_ptr(&data);

// pass-by-value, WRONG!
void process_value(ArrayLib::ArrayIJK<float> arr);
process_value(data)

Furthermore, practically speaking, in a simulation, large arrays to represent electromagnetic fields and material properties are allocated and passed around as pointers. Attempting to pass an array by value is always a bug.

@thliebig
Copy link
Owner

thliebig commented Dec 9, 2024

It minimizes the changes to the existing engine code. This is useful because the upcoming Tiling engine code is experimental, so we want to keep the old one for reference.

But this requires massive change of all existing array indexing code throughout the codebase, which is error-prone and difficult to review.

I did not yet have the time to check out this massive code-changes. But looking at these comments sounds to be that this is a compromise to not change to much but by using a wrapper class?

What would be the downside from "going all the way at once"? Would we still need this or a wrapper? Or could C++23 handle it all? Would it really be this bad? Afterall the engine interfaces already supply simple access anywhere outside the engine?

Would it not be possible inside the engine itself to use simple linear array access (pointer math) without any need for a wrapper?
Well I guess I have to dig much deeper to understand all this...

@thliebig
Copy link
Owner

thliebig commented Dec 9, 2024

Another question I have is, why did you also change the basic/simple Operator&Engine? Why not keep it as simple and basic as it gets even if that is known to be slower and only do this advancements for the SSE+multi-threading operator&engines?
My basic idea always has been to keep a simple and easy to understand baseline engine?

@thliebig
Copy link
Owner

thliebig commented Dec 9, 2024

Would/have you considered to "just" create an all new engine? At first using only your new interface and eventually using maybe AVX and your new tiling? An argument against this would be that we already have too many operators&engines...

@biergaizi
Copy link
Contributor Author

I did not yet have the time to check out this massive code-changes.

I don't think it's a massive change. The old Pull Request was the real massive one. For example, compare the change in FDTD/engine_sse_compressed.cpp. The old patch had 90 insertions and 84 deletions, the new patch only has 6 insertions. This reduces the code change by a factor of 3000%.

What would be the downside from "going all the way at once"? Would we still need this or a wrapper? Or could C++23 handle it all? Would it really be this bad?

The old Pull Request did it this way - it changes all array syntax to the new-style operator() syntax in the library entirely. The problem is that increases the length of the .diff dramatically. But if you think the nested operator[] trick is only a stop-gap, I can do it the other way.

Another question I have is, why did you also change the basic/simple Operator&Engine? Why not keep it as simple and basic as it gets even if that is known to be slower and only do this advancements for the SSE+multi-threading operator&engines?

Three reasons. The first is that the Operator base class handles material properties and field dumps, so this part can limit the performance of all engines. The second reason is that my ultimate goal is to remove the old array implementation array_ops.cpp - after all the Engine Extensions are converted, the Basic engine will be the only user of this obsolete array code...

Would/have you considered to "just" create an all new engine? At first using only your new interface and eventually using maybe AVX and your new tiling? An argument against this would be that we already have too many operators&engines...

Fully replacing the current multi-threading engine is problematic:

  1. The Tiling engine is very sensitive to simulation domain size, and it's not always fast.
  2. The Tiling engine requires new dependencies such as Intel TBB, which is used to handle workload balancing. This implementation may even change in the future.
  3. It's not compatible with all extensions, and so far I'm not sure about how al of these problems can be solved.
  4. I've experimented with AVX on at least 3 different occasions, my conclusion is that it has basically no benefits. Memory access pattern is much more important than AVX. For example, by making vectors longer, you also change the granularity of operator compression. Everything interacts in ways that are difficult to understand.

So in no sense it's a drop-in replacement, it's only an opt-in option for users who can get the benefits.

One idea, perhaps my long-term plan, is to reimplement FDTD entirely as a shared library. The high performance engine exists fully independent from the existing one. This way many different engines can be added, such as a GPU engine, a MPI engine, a new Tiling engine, etc. This way, the new high-performance engine will not be limited to the structure of the existing classes. At the same time, there's minimum disruption to existing users.

@thliebig
Copy link
Owner

thliebig commented Dec 26, 2024

Question: Does not any access to e.g. the 4D array cause a linearIndex call with multiple multiplications and additions? (I'm aware that this was also the case before). Now that we have a nice linear array in memory would it not make sense for the engine to calculate the linear index once and re-use that index e.g. inside the engine loops?
But I guess this would just be a new optimization approach and that's not what this PR wants to solve for now.
My primary question is: would such a direct "access" into the linear array be possible (and make sense?) and if so how?

Could there be a function to retrieve the linear array directly and access it directly?
Is it still possible to use the arr(n, i, j, k) with this ArrayLib class?

@biergaizi
Copy link
Contributor Author

biergaizi commented Dec 27, 2024

Question: Does not any access to e.g. the 4D array cause a linearIndex call with multiple multiplications and additions? (I'm aware that this was also the case before). Now that we have a nice linear array in memory would it not make sense for the engine to calculate the linear index once and re-use that index e.g. inside the engine loops?

My opinion is simple. The current implementation may not be the most optimal one, but it has already demonstrated a performance improvement based on measurement data. If anyone (a contributor or possibly myself in the future) believes one can improve performance merely reusing indices, they are free to implement this optimization, but this must be justified using as much data as I did here. Personally, from my tests, I don't see any performance benefit of reusing these pre-computed indices for now. It has an overhead but it's masked by memory access overhead anyway on most machines.

My primary question is: would such a direct "access" into the linear array be possible (and make sense?) and if so how?

Yes, it's already possible now. Like the standard C++ programming convention, one can use the data() method of the base class to access the exposed raw memory wrapped inside the object.

// return raw pointer to underlying data
T*                          data()              const { return m_ptr;      }
T&                          data(IndexType n)   const { return m_ptr[n];   }

Thus, using a precomputed index can be done via something like the following...

size_t index = array.linearIndex({n, i, j, k});
array.data(index) = 42;

Is it still possible to use the arr(n, i, j, k) with this ArrayLib class?

Three kinds of syntax are supported:

array(n, i, j, k);
array[n][i][j][k];
array.data(idx);

The first and second styles have similar performance, since nested operator[] calls are optimized away by the C++ compiler at build time.

@thliebig
Copy link
Owner

One more question and this is just me trying to hopefully fully understand this new array approach soon:

The GetVV method could be written like this right?:

	inline virtual FDTD_FLOAT GetVV(unsigned int n, unsigned int x, unsigned int y, unsigned int z) const
	{
		return (*vv_ptr)(n,x,y,z);

would be a bit shorter and easier? as the current:

	inline virtual FDTD_FLOAT GetVV(unsigned int n, unsigned int x, unsigned int y, unsigned int z) const
	{
		ArrayLib::ArrayNIJK<FDTD_FLOAT>& vv = *vv_ptr;
		return vv[n][x][y][z];

BUT: no need to change, just me asking, testing and understanding...
Maybe someday we might switch to this short and neater version... maybe...

In any case, I think I'm done with my review. How should I merge this?
a.) Normal (no-ff) merge
b.) Rebase and merge
c. ) Squash and merge

I think personally I would prefer c over b over a. Because these are a lot of commits and it looks to me as there is a some back and forth and I did the code review as a single change anyhow... But since I'm asking I could obviously live with any of the three... Please let me know what you would prefer...

And thanks again for your work on this...

Motivation
-------------

Originally, multi-dimensional arrays in the simulation engine were
allocated as multi-level C pointers. For example, the following code
allocates and accesses a 3D array:

    // allocate 3D arrays as pointers of pointers
    T*** array=NULL;
    array = new T**[SIZE_I]
    for (size_t i = 0; i < SIZE_I; i++)
    {
        array[i] = new T*[SIZE_J]
        for (size_t j = 0; j < SIZE_J; j++)
        {
            array[i][j] = new T[SIZE_K];
            for (size_t k = 0; k < SIZE_K; k++)
                {
                // access array via C-style syntax
                        array[i][j][k] = 0;
                }
        }
    }

The advantage is that all multi-dimensional arrays can be accessed
like a C-style array. But with several downsides:

1. A single array requires multiple `malloc()` or `new` allocation.
2. The allocated memory is not contiguous, reducing memory locality
and performance, especially when the final dimension is small.

According to the best practice today (e.g. `std::mdspan`), these raw
pointers should be converted to wrapped arrays classes with their
internal indexing code, and they should be accessed using the following
syntax:

    array(i, j, k) = 0;
    array[i, j, k] = 0; // C++23

But this requires massive change of all existing array indexing code
throughout the codebase, which is error-prone and difficult to review.

ArrayLib is a simple library written with the goal of providing a simple
implementation of multi-dimensional array, but preserved the C-like access
syntax using C++ template metaprogramming to minimize the code change in
the simulation engine.

Supported Types
--------------------

1. ArrayIJ - 2D array.
2. ArrayIJK - 3D array.
3. ArrayNIJK - 3D array (N = 3 only), IJK is the coordinate of a 3D
vector in space, N (0, 1, 2) selects the X/Y/Z value from the vector.

To avoid ambiguity, IJK is always used to describe coordinates in
space, not X/Y/Z.

Usage 1: new/delete
--------------------

ArrayLib can be used in several ways, the first method closely mirrors
the C-style code using raw pointers, allowing direct migration from
legacy code.

    ArrayLib::ArrayIJK<float>* arr_ptr = new ArrayLib::ArrayIJK("name", {10, 20, 30});

    // arr_ptr[0][1][2] is wrong, because we need to call the operator[]
    // of ArrayIJK, rather than accessing an array of pointers of type
    // `ArrayIJK*`.
    (*arr_ptr)[0][1][2] = 3;

    // But one can use C++ reference rather than pointers to access it
    // in a readable way, add a line of boilerplate, but more readable
    ArrayLib::ArrayIJK<float>& arr = *arr_ptr;
    arr[0][1][2] = 3;

    delete arr_ptr;

The first method is advantageous in the sense that all existing pointers
with inconsistent lifetimes can be converted to ArrayLib directly as
drop-in replacement. The disadvantage is that the C-style syntax is only
emulated we're accessing a C++ reference, not a C pointer. Yet C++ reference
can't be NULL, so when arrays with inconsistent lifetimes are used in
classes, C-style pointers (with confusing syntax) must be kept as member
variables, not references (with natural syntax).

To write readable code, whenever an array is accessed by a member function,
we must dereference a pointer to create a C++ reference first. Overall, this
is the smoothest migration path, so this style is widely used in engine classes.

Usage 2: As Local Member Variable w/ RAII
------------------------------------------

ArrayLib arrays can be used as local variables by declaring them directly
(with name and size). These arrays are automatically freed by C++'s
RAII mechanics when they go out of scope.

    void process()
    {
        ArrayLib::ArrayIJK<float> arr("name", {10, 20, 30});
        array[0][1][2] = 0;
        // deleted automatically by RAII
    }

This allows one to access arrays using the natural syntax, and without manual
memory management,

Usage 3: As Class Member Variable w/ RAII
-----------------------------------------

Similarly, ArrayLib arrays can also be used as class member variables
in conformance to the C++ RAII idiom. Class member variables are
recommended for arrays with runtime-determined size.

    class Engine
    {
        // automatically deleteled by RAII
        ArrayLib::ArrayIJK<float> arr1("static size", {10, 20, 30});
        ArrayLib::ArrayIJK<float> arr2;

        Engine(size_t i, size_t j, size_t k) :
            arr2("initialize in constructor", {i, j, k})
        {
            // use natural C-style syntax for acess
            arr1[1][2][3] = 0;
            arr2[4][5][6] = 1;
        }
    }

Ideally, this should be the recommended usage for ArrayLib arrays in
C++ classes.

Unfortunately, because the existing codebase does not use the C++ RAII
idiom, and arrays mostly have inconsistent lifetimes (e.g. sometimes,
array sizes can't be determined in the constructor, but instead is
determined in the middle of the class's lifetime from an external
source), it's not possible to translate all existing array usages to
this idiom.

Usage 4: As Class Member Variable w/ Two-Phase Initialization
--------------------------------------------------------------

As a compromise to the lifetime problem, two-phase Initialization is
also supported. This can be done by creating an array object (not
reference or pointers) as a class member variable without specifying
its name and size. The result is a placeholder array in an invalid
state, which can be later initialized.

    class Engine
    {
        // automatically deleteled by RAII, size unknown
        ArrayLib::ArrayIJK<float> arr1;

        Engine(size_t i, size_t j, size_t k);
        {
            // two-phase initialization before use
            arr1.Init("name", {i, j, k});

            // use natural C-style syntax for acess
            arr1[1][2][3] = 0;
        }
    }

Although it's potentially unsafe, but one can still enjoy automatic
release of memory upon class destruction by C++'s RAII.

ArrayLib Arrays Can't be Passed by Value
--------------------------------------------

ArrayLib objects are simple wrappers to heap-allocated memory, with few extra
features in comparison to raw C pointers. In fact, they're designed to serve as
drop-in replacement of raw pointers, which is prevalent in the existing codebase
(we expect to refactor the codebase progressively). Thus, they're not meant to
be high-level C++ containers. ArrayLib objects must be always passed as pointers
or references, pass-by-value and copy-assignment are not supported. These
operators are marked as deleted, attempting to do so would create a compile-time
error. They make little sense for the low-level numerical simulation kernel in
its current form.

    ArrayLib::ArrayIJK<float> data("name", size);

    // pass-by-reference, recommended when possible
    void process_ref(ArrayLib::ArrayIJK<float>& arr);
    process_ref(data);

    // pass-by-pointer, use only when necessary
    void process_ptr(ArrayLib::ArrayIJK<float>* arr_ptr);
    process_ptr(&data);

    // pass-by-value, WRONG!
    void process_value(ArrayLib::ArrayIJK<float> arr);
    process_value(data)

Furthermore, practically speaking, in a simulation, large arrays to represent
electromagnetic fields and material properties are allocated and passed around
as pointers. Attempting to pass an array by value is always a bug.

Signed-off-by: Yifeng Li <[email protected]>
@biergaizi
Copy link
Contributor Author

I have just rebased the Pull Request myself, it should now be possible to do a regular (fast-forward) merge.

@thliebig thliebig merged commit 078483d into thliebig:master Dec 29, 2024
12 checks passed
@KJ7LNW
Copy link

KJ7LNW commented Dec 29, 2024

Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants