Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Poor scalability by thread number in PF #6

Open
facontidavide opened this issue Oct 21, 2019 · 2 comments
Open

[Discussion] Poor scalability by thread number in PF #6

facontidavide opened this issue Oct 21, 2019 · 2 comments

Comments

@facontidavide
Copy link
Contributor

This is just a brainstorming, not really an "issue". You don't need to "solve" it, it is just an open discussion between nerds :)

I noticed that the scalability of the PF Slam is quite poor with the number of threads.

For instance, moving from 4 threads to 8 increase performance only by 50%. Note that the profiler still say that we are using 100% of 8 CPU!

I do know that there isn't such a thing as perfect scalability, but in this case I think there "might" be a bottleneck somewhere.

I inspected the code and I couldn't find any mutex or potential false sharing, but of course I haven't done an exhaustive search.

@eupedrosa
Copy link
Member

I have an image that can help the discussing:
mt_speedup png-1
The number of particles is 30.

In my opinion there are a few things that can explain these behavior:

  • Multi-threading does not speedup the full execution path. It parallels scan matching and ray integration (i.e. mapping). Normalizing the weights for resampling is a sequential action. Thus doubling the threads does not provide ~2x speedup.

  • More threads can results in an execution penalty while handling the thread-pool. From the image you can see that there is an asymptotic speedup. It can even degrade performance.

  • Each particle has a map with implicit data sharing (Copy-On-Write). Writing to a map can result in concurrent access to data: mutex lock -> duplicate data -> mutex unlock. The more data is shared between particles the more times this happens.

  • CPU affinity? If I am not mistaken the linux kernel will hop logical cores when setting a process ready to run. This can result in cache misses.

@facontidavide
Copy link
Contributor Author

I have the feeling that it is mostly related to point 3, but I might be wrong.

Anyway, performance gain decrease rapidly above 4 threads

MatjazBostic pushed a commit to UbiquityRobotics/lama_core that referenced this issue Oct 1, 2024
…-set

[Hybrid SLAM] Add support to mapping pause and custom map setting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants