nvortexVc

Direct Biot-Savart solver for 2D and 3D vortex blobs accelerated with Vc

Build and run

This should be easy on a Linux host, but first you may want to build and install Vc.

git clone https://github.com/VcDevel/Vc.git
cd Vc
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/opt/Vc -DBUILD_TESTING=OFF ..
make -j 4
sudo make install

Now, back in this directory you can build the binaries with

mkdir build
cd build
cmake ..
make

Replace the above cmake command with one of the following to allow build with Vc and/or OpenMP:

cmake -DCMAKE_BUILD_TYPE=Release ..
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_VC=TRUE -DVc_DIR=/opt/Vc/lib/cmake/Vc ..
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_OMP=TRUE ..

Run the code with one argument: the number of particles to use.

./nvortex2d.bin -n=10000
./nvortex3d.bin -n=10000

Parallel runs

To generate a distributed-memory build using MPI, first ensure that MPI is visible. This is typically accomplished with something like

sudo dnf install open-mpi openmpi-devel
module load mpi

Then in CMake, set USE_MPI=ON and generate the build. Run with something like

OMP_NUM_THREADS=1 mpirun -v -n 2 ./nvortex3d.bin -n=10000

This will run two processes, each with one OpenMP thread.

Performance

Performance of this algorithm accelerated with Vc is slightly better than from my SimdNBodyKernels code which uses Intel's ispc compiler, and the code here is easier to create and understand. A little care must be taken to pad arrays with proper values (padded particle radius must be nonzero), lest the inner kernel try to divide by zero when using the last vector register set of particles.

On a 16-core Intel i9-7960X Skylake CPU, with a large-enough problem, the nvortex3domp code exceeded 1 TFlop/s - the first time I've ever seen that happen on a single CPU. The 16-core Ryzen 9 achieved 1133 GFlop/s with ./nvortex3domp -n=100000.

The table below gives peak performance for the 3D case (with no grads) for combinations of x86 (possibly auto-vectorized) vs. Vc, and single-thread vs. OpenMP multithreaded. Each was run on a Fedora or Ubuntu machine and compiled with GCC (7.2, 7.3, 9.3.1, 11.2). The numbers are all GFlop/s and the problem size was 30000. Note that the 5800X can perform 8 FMAs (full 256-bit AVX2 buffers) on each of its two capable FPUs, per core, per clock - incredible!

GFlop/s	x86 serial	x86 OMP	Vc serial	Vc OMP	CPU peak	% of theoretical
Intel i3-8109U	5.2	11.5	69.6	145	230.4	63%
Intel i7-7500U	5.1	11.0	62.2	121	204.8	59%
Intel i7-5960X	4.3	34.0	44.0	369	768	48%
Intel i9-7960X	4.3	64.5	58.5	813	1434	57%
AMD Ryzen 7 2700X	4.5	47.3	42.9	289	499.2	58%
AMD Ryzen 9 3950X	6.2	62.3	86.6	790	2048	39%
AMD Ryzen 7 5800X	8.2	63.0	105.8	767	1208	64%

Note that the peak flops rate depends on problem size, number of cores, and cache sizes. For example, the AMD Ryzen 9 3950X can achieve 1154 GFlop/s (56% of theoretical peak) with nvortex3domp -n=150000.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
doc		doc
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nvortexVc

Build and run

Parallel runs

Performance

About

Releases

Packages

Languages

License

Applied-Scientific-Research/nvortexVc

Folders and files

Latest commit

History

Repository files navigation

nvortexVc

Build and run

Parallel runs

Performance

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages