Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Parallelization] Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils #12923

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

loumalouomega
Copy link
Member

@loumalouomega loumalouomega commented Dec 11, 2024

📝 Description

Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils. I still need to add a benchmark and actually compare that is faster. Also updates the bannerwith the parallelism information:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 10.1."0"-core/explicit-schedule-parallel-utili-d7754dadfa-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with Clang-14.0
Compiled with threading and MPI support. Threading support with OpenMP, scheduling dynamic.
Maximum number of threads: 20.
Running without MPI.
  • Add benchmark
  • Compare results

Fixes #12924

🆕 Changelog

@loumalouomega loumalouomega added Kratos Core Performance Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads labels Dec 11, 2024
@loumalouomega loumalouomega changed the title [Core] Making explicitily schedule(dynamic) by default in OMP loops in ParallelUtils [Core][Parallelization] Making explicitily schedule(dynamic) by default in OMP loops in ParallelUtils Dec 11, 2024
…ith dynamic schedule without conflicting the GIL
@RiccardoRossi
Copy link
Member

are u sure this is needed? because this is c++ code, i don't think the gil presents a problem here

@loumalouomega
Copy link
Member Author

are u sure this is needed? because this is c++ code, i don't think the gil presents a problem here

Look at https://github.com/KratosMultiphysics/Kratos/actions/runs/12273173829/job/34243450170

@loumalouomega
Copy link
Member Author

loumalouomega commented Dec 11, 2024

are u sure this is needed? because this is c++ code, i don't think the gil presents a problem here

Look at KratosMultiphysics/Kratos/actions/runs/12273173829/job/34243450170

And now is failing when running tests: https://github.com/KratosMultiphysics/Kratos/actions/runs/12275201329/job/34250231555?pr=12923. I will define in CMake

@RiccardoRossi
Copy link
Member

@loumalouomega dynamic scheduling is used today for example in the builder and solver....without the need of releasing the GIL

why is that different?

@loumalouomega
Copy link
Member Author

@loumalouomega dynamic scheduling is used today for example in the builder and solver....without the need of releasing the GIL

why is that different?

No idea, look at the outcome from the CI. We tested for some functions and the improvement is significant. This was added in a recent version of pybind11. pybind/pybind11#4246

@loumalouomega
Copy link
Member Author

Okay, looks like the last change fixed the issue

@loumalouomega loumalouomega marked this pull request as ready for review December 11, 2024 14:53
@loumalouomega loumalouomega requested a review from a team as a code owner December 11, 2024 14:53
@loumalouomega
Copy link
Member Author

@RiccardoRossi we can set it on runtime with this: https://www.openmp.org/spec-html/5.0/openmpse49.html and keep the current code and set the OMP_SCHEDULE by default to "dynamic"

@loumalouomega
Copy link
Member Author

Modified to be on runtime, defaulted to dynamic

@loumalouomega loumalouomega changed the title [Core][Parallelization] Making explicitily schedule(dynamic) by default in OMP loops in ParallelUtils [Core][Parallelization] Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils Dec 12, 2024
@loumalouomega
Copy link
Member Author

Okay, looks like the runtime works

@RiccardoRossi
Copy link
Member

Right now if you have 4 tareas and 1000 items, you will do 250 on each...definitely suboptimal.for dyna.ic scheduling...

@loumalouomega
Copy link
Member Author

loumalouomega commented Dec 12, 2024

Right now if you have 4 tareas and 1000 items, you will do 250 on each...definitely suboptimal.for dyna.ic scheduling...

The default is dynamic, not dynamic, 4. dynamic,4 is just an example, not the actual default. Anyway i am seeing that is not taking properly the environment variable.

@loumalouomega
Copy link
Member Author

Right now if you have 4 tareas and 1000 items, you will do 250 on each...definitely suboptimal.for dyna.ic scheduling...

The default is dynamic, not dynamic, 4. dynamic,4 is just an example, not the actual default. Anyway i am seeing that is not taking properly the environment variable.

Okay fixed that issue, BTW, now the the banner includes the parallelism information:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 10.1."0"-core/explicit-schedule-parallel-utili-d7754dadfa-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with Clang-14.0
Compiled with threading and MPI support. Threading support with OpenMP, scheduling dynamic.
Maximum number of threads: 20.
Running without MPI.

@@ -206,7 +206,7 @@ class BlockPartition
KRATOS_PREPARE_CATCH_THREAD_EXCEPTION

TReducer global_reducer;
#pragma omp parallel for
#pragma omp parallel for schedule(runtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loumalouomega as i am telling, take a look to line 154. It does not make sense to change this unless we change what happens there.

also to my understanding the runtime behaviour has potentially a very high overhead due to the need of making a syscall to fetch an env variable.

https://stackoverflow.com/questions/7460552/reading-environment-variables-is-slow-operation/7460612#7460612

not sure if that matters...but at least we need to beware of this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the benchmark to check that it affects significantly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the runtime is to give flexibility, if you prefer we can define it on compiling time...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loumalouomega aside of the comments on the opportunity of using the OMP_SCHEDULE did u take a look at what i am writing?

we are doing "by hand" the chunking. If we don't change that, it makes no sense to use a different scheduling, as everyone will be working on its chunk (as of now we dot have more chunks than threads!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay...let me think this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Centrantly there is no effect (significative), I need to rethink this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we may need to rethink the chunging (to be dependent of the CPU architecture)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RiccardoRossi what do you suggest exactly, because I has been studying this and our chunking conflicts with the OMP scheduling, and a priori the most efficient would be to let OMP to do the chunking. The problem is that with that we lose the parallel_utilities design and reduction utilities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here there is a description of the scheduling options.

https://hpc.dmi.unibas.ch/wp-content/uploads/sites/87/2019/11/2018_patrick_buder_ma_thesisjanuary2018.pdf

we could just replicate the one of choice

# Check if the environment variable OMP_SCHEDULE is defined
if(DEFINED ENV{OMP_SCHEDULE})
# Set the already defined one
set(KRATOS_OMP_SCHEDULE $ENV{OMP_SCHEDULE})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMP_SCHEDULE is a runtime env variable, it is a extremely bad idea to use it a compilation switch (IMO).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but this is the following.

During compilation the OMP_SCHEDULE will set KRATOS_OMP_SCHEDULE that will be used as default if actually OMP_SCHEDULE is not defined, but if OMP_SCHEDULE is defined OMP_SCHEDULE will be taken into account. Do you understand me?

@pooyan-dadvand
Copy link
Member

I agree with chunk size argument by @RiccardoRossi

My point (in #12924) was to first give a way to define dynamic scheduling in our for each loop. This would let us to fine tune our parallelization in many cases that dynamic would be better or at least not worst.

For having the dynamic as default now I understand that would not work and chunk size would be an important blocker....

@RiccardoRossi
Copy link
Member

I agree with chunk size argument by @RiccardoRossi

My point (in #12924) was to first give a way to define dynamic scheduling in our for each loop. This would let us to fine tune our parallelization in many cases that dynamic would be better or at least not worst.

For having the dynamic as default now I understand that would not work and chunk size would be an important blocker....

to clarify, it is NOT difficult to change the chunking algorithm (i guess it will be a 20 lines long code), i am simply telling that it needs to be done aside of the other changes.

@loumalouomega loumalouomega marked this pull request as draft December 18, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kratos Core Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][Parallelization] Shall we change our parallel utils to use dynamic scheduling instead of static?
4 participants