Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filters] The relationship between Convolution3D running speed and number of threads #6131

Closed
Ru1yi opened this issue Sep 11, 2024 · 16 comments
Labels

Comments

@Ru1yi
Copy link

Ru1yi commented Sep 11, 2024

Describe the bug
I was trying to use the Gaussian convolution filtering algorithm below the filters module. When I set the number of threads, I found that when setNumberOfThreads input 1, Gaussian convolution runs the fastest, and the more threads, the slower the running speed. Below is my source code:

auto st = std::chrono::high_resolution_clock::now();
pcl::filters::GaussianKernel<PointT, PointT> kernel;
// Set gaussian kernel
kernel.setSigma(4);
kernel.setThresholdRelativeToSigma(4);
kernel.setThreshold(0.05);
// Create kdtree
pcl::search::KdTree<PointT>::Ptr tree(new pcl::search::KdTree<PointT>);
tree->setInputCloud(cloudptr);
// Set convolution params
pcl::filters::Convolution3D<PointT, PointT, pcl::filters::GaussianKernel<PointT, PointT>> convolution;
convolution.setKernel(kernel);
convolution.setInputCloud(cloudptr);
convolution.setNumberOfThreads(1);
convolution.setSearchMethod(tree);
convolution.setRadiusSearch(0.01);
PointCloudPtr g_filtered(new PointCloudT);
convolution.convolve(*g_filtered);
cloudptr = g_filtered;
auto et = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = et - st;
qDebug() << "[Filter] Gaussian filter time: " << duration.count() * 1000 << "ms";

Test Result

setNumberOfThreads Average run time(ms)
1 1350
2 1700
4 2200
8 3100
default 7200

My Environment:

  • OS: Windows 11
  • IDE: VS2019
  • Compiler: MSVC2019
  • PCL Version 1.12.1
  • Original point cloud data:128(width)*1800(height) size:230400

Possible Solution

It looks like the time spent on thread management exceeds the time spent on the algorithm itself. So is it that my parameter settings are wrong or Convolution3D cannot be used for organized point clouds?

@Ru1yi Ru1yi added kind: bug Type of issue status: triage Labels incomplete labels Sep 11, 2024
@mvieth
Copy link
Member

mvieth commented Sep 11, 2024

@Ru1yi I did a quick test with a different cloud with GCC on Linux, and I got 5141ms - 4105ms - 2448ms - 1426ms (1 thread - 2 threads - 4 threads - 8 threads respectively).
How did you install PCL? Do you enable OpenMP while compiling your project? I don't have much experience with MSVC and OpenMP, but I do not think that thread management takes that long.

@Ru1yi
Copy link
Author

Ru1yi commented Sep 12, 2024

I installed PCL using PCL-1.12.1-AllInOne-msvc2019-win64.exe from the official release. If I want to use PCL with OpenMP, do I need to configure anything specifically? I enabled the OPENMP support in the VS project configuration.
Here are the runtimes when I turn off OpenMP support:

setNumberOfThreads Average run time(ms)
1 1350
2 1330
4 1350
8 1380

It looks like it was closed successfully. Could it be a problem with my data? Can you help me confirm the support of msvc2019 for openmp? Thanks a lot.

@mvieth
Copy link
Member

mvieth commented Sep 12, 2024

@Ru1yi Can you post the point cloud you are using as a zipped PCD or PLY file? I will also try to test Convolution3D with MSVC as soon as I can.

@Ru1yi
Copy link
Author

Ru1yi commented Sep 13, 2024

@mvieth 3644.500049090.zip Here is the data I used for testing, which was collected using the Robosense 128-line mechanical lidar. Thank you for your support.

@mvieth
Copy link
Member

mvieth commented Sep 13, 2024

I did some testing (VS2022, PCL 1.13.0, with 3644.100042090.pcd), but I didn't notice any increasing run time with more threads. 2 threads are always faster than 1 thread. I did notice that at some point, more threads did not make it any faster, maybe around 4 threads (even though my computer has 6 physical cores with 2 hyperthreads each). I only had to enable OpenMP support at one place in the project configuration (set to Yes (/openmp)). Do you build in Debug or Release configuration? Do you run with or without debugging?

@Ru1yi
Copy link
Author

Ru1yi commented Sep 18, 2024

Because I use Qt5 and VS2019 for joint development, the version of PCL is 1.12.1. If you use VS2022 and PCL1.13.0 or above, you need to develop with Qt6. In my previous tests, PCL1.12.1 does not support MSVC2022, and migrating the entire program to Qt6 is a large workload, so I want to know whether the version will affect this problem. I build this project in Debug and run with debugging.

@mvieth
Copy link
Member

mvieth commented Sep 18, 2024

I don't think that PCL 1.12.1 vs PCL 1.13.0 or VS2019 vs VS2022 makes any difference for the multithreading performance. I would suggest to also test with your project built in Release configuration and run without debugging. I could imagine that those two lead to more thread management overhead.

@Ru1yi
Copy link
Author

Ru1yi commented Sep 24, 2024

I tested the Release runtime in both debugging and non-debugging modes.

debugging mode

setNumberOfThreads Average run time(ms)
1 210
2 180
4 210
8 300

non-debugging mode

setNumberOfThreads Average run time(ms)
1 190
2 185
4 220
8 310

It seems that the running time is much shorter than in Debug mode and the shortest running time is achieved when the number of threads is 2. But I still don't see that increasing the number of threads will speed up the algorithm.

@mvieth
Copy link
Member

mvieth commented Sep 24, 2024

@Ru1yi Okay, this is getting closer to the results I had on MSVC. At least now, using 2 threads is faster than using 1 thread, which is promising. Can you say which CPU you are testing on? (manufacturer, model number etc)

@larshg
Copy link
Contributor

larshg commented Sep 25, 2024

I can verify that I get similiar results with VS2022 and pcl master.
I tried out VTune from intel to do a profiling of the example code and with 4 threads it looks like this:
image

Seemingly a lot of time goes on by allocating and deallocating resultvector in the flann usage - it creates a new array for its neighbors and allocates 1024 units(floats) of space each time.

https://github.com/flann-lib/flann/blob/f9caaf609d8b8cb2b7104a85cf59eb92c275a25d/src/cpp/flann/algorithms/nn_index.h#L626

https://github.com/flann-lib/flann/blob/f9caaf609d8b8cb2b7104a85cf59eb92c275a25d/src/cpp/flann/util/result_set.h#L435

Also, I tried visualize your source cloud and the result of the convolution which doesn't have a lot of difference. I think you might want to increase your radius and perhaps adjust the other parameters as well.

edit: update link to correct line in first link.

@larshg
Copy link
Contributor

larshg commented Sep 25, 2024

Ohh and as number of threads increase, the more time is spent on allocating/deallocating and waiting 😢

@Ru1yi
Copy link
Author

Ru1yi commented Oct 18, 2024

@Ru1yi Okay, this is getting closer to the results I had on MSVC. At least now, using 2 threads is faster than using 1 thread, which is promising. Can you say which CPU you are testing on? (manufacturer, model number etc)

This is the CPU model I used for testing:Intel(R) Core(TM) i7-14700HX 2.10 GHz

@mvieth
Copy link
Member

mvieth commented Oct 18, 2024

This is the CPU model I used for testing:Intel(R) Core(TM) i7-14700HX 2.10 GHz

That is good to know, because then your CPU is one of the new models that has "Performance-cores" and "Efficient-cores" (see https://ark.intel.com/content/www/us/en/ark/products/235997/intel-core-i7-processor-14700hx-33m-cache-up-to-5-50-ghz.html ). Basically, P-cores are faster but more power hungry while E-cores are slower but more energy efficient. In contrast, my CPU is an older one where all cores are the same.

Maybe on your computer there is some weird effect that if you run with one thread, a P-core is used, but if you run with more threads, E-cores are used.

Here is some information I found about OpenMP with P-cores/E-cores:

The consensus seems to be to choose an appropriate work scheduling to keep all cores busy (dynamic schedule). Convolution3D currently does not specify any schedule, while usually means a static schedule is chosen by the compiler (all threads are assigned the same number of points to process). In fact, a dynamic schedule might even be beneficial for Convolution3D for CPUs where all cores are the same (like mine), because the radius search does not take the same amount of time for each point. I will investigate and open a pull request when I have time.

@Ru1yi
Copy link
Author

Ru1yi commented Oct 19, 2024

This is the CPU model I used for testing:Intel(R) Core(TM) i7-14700HX 2.10 GHz

That is good to know, because then your CPU is one of the new models that has "Performance-cores" and "Efficient-cores" (see https://ark.intel.com/content/www/us/en/ark/products/235997/intel-core-i7-processor-14700hx-33m-cache-up-to-5-50-ghz.html ). Basically, P-cores are faster but more power hungry while E-cores are slower but more energy efficient. In contrast, my CPU is an older one where all cores are the same.

Maybe on your computer there is some weird effect that if you run with one thread, a P-core is used, but if you run with more threads, E-cores are used.

Here is some information I found about OpenMP with P-cores/E-cores:

The consensus seems to be to choose an appropriate work scheduling to keep all cores busy (dynamic schedule). Convolution3D currently does not specify any schedule, while usually means a static schedule is chosen by the compiler (all threads are assigned the same number of points to process). In fact, a dynamic schedule might even be beneficial for Convolution3D for CPUs where all cores are the same (like mine), because the radius search does not take the same amount of time for each point. I will investigate and open a pull request when I have time.

Thank you very much for your help. I will continue to pay attention to this issue when I have time.

@mvieth
Copy link
Member

mvieth commented Oct 19, 2024

Also, I tried visualize your source cloud and the result of the convolution which doesn't have a lot of difference. I think you might want to increase your radius and perhaps adjust the other parameters as well.

I want to emphasise this comment by Lars again: I just checked and with the search radius setting from above (0.01) and the point clouds you uploaded, almost all radius searches find no neighbours, only the search point itself. Setting the search radius to 0.1 seems to be more appropriate, but probably even higher. Not sure what good settings for the gaussian kernel are.

@mvieth
Copy link
Member

mvieth commented Oct 20, 2024

Convolution3D now uses a dynamic schedule which should result in a better work balancing between the threads/cores: #6155
I hope that this leads to the desired behaviour on your computer, but either way I don't see anything else we can do on the PCL side.

@mvieth mvieth closed this as completed Oct 20, 2024
@mvieth mvieth added module: filters and removed status: triage Labels incomplete labels Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants